Apache Data Lakehouse Weekly: June 4 to June 11, 2026

The Apache Iceberg community debated a v2 loadTable endpoint for the REST catalog protocol this week, with Ryan Blue proposing the change to support optional locations and snapshots while moving credentials out of properties. Christian Thiel of the Lakekeeper project pushed back against mandatory client failure semantics for unsupported restrictions, arguing strict failure creates adoption friction. The discussion, which also considered an X-Iceberg-Client-Capabilities header, will shape how REST catalogs handle capability negotiation going forward.

The lakehouse community spent this week arguing about versions, and the arguments mattered. Parquet contributors produced the single largest thread across all five projects with a 40-message debate on what Parquet versioning should even mean, while Iceberg shipped four release candidates of its C++ implementation in seven days and locked in a patch release plan for its two production lines. Underneath the release activity, a quieter theme connected everything: how these projects make decisions. Polaris debated merge button mechanics and HTTP status codes, Parquet contributors insisted that working group syncs cannot replace mailing list consensus, and Arrow wrote down rules for AI-generated code reviews. The formats are maturing, and so is the governance around them. Before getting into each project, the raw numbers set the scene. The five dev lists combined for 358 emails this week. Iceberg led with 135 emails across 34 threads from 51 distinct participants, followed by Polaris at 114 emails across 23 threads from a tight group of 14 regulars. Parquet concentrated 72 emails into only 7 threads, which tells you its conversations ran deep rather than wide. Arrow posted 24 emails across 11 threads from 18 participants, and DataFusion rounded things out at 13 emails across 6 threads. The shape of those numbers matters as much as the totals. Iceberg's breadth reflects a project with a dozen parallel workstreams from spec evolution to language implementations to community events. Polaris's depth from a small group reflects a project where a core team is hammering out operational fundamentals. And Parquet's concentration reflects a community wrestling with a handful of existential questions all at once. The Iceberg dev list logged 135 emails across 34 threads from 51 participants this week, and the headline work happened in the spec. Ryan Blue's vote to add a draft bitmap spec to git https://lists.apache.org/thread/hkj1tx8vwnsncdp11czzqsv5pbwds4h4 drew 14 messages and broad support, with binding +1s from Amogh Jahagirdar and others, plus non-binding approval from Micah Kornfield, who left clarity comments for implementers. The bitmap format targets small bitmaps, and the discussion surfaced a practical wrinkle worth watching. Péter Váry supported the move but flagged that delete vectors will need good compression if the community wants to store them in metadata files. Kornfield also asked Ryan a sharp process question: given the limited nature of the vote, what are the decision factors for actually promoting the draft to a finalized spec? That question echoes through several other Iceberg threads this week, because the project is increasingly comfortable landing draft specifications in git and iterating in the open rather than perfecting documents in Google Docs first. The most consequential design debate centered on the REST catalog protocol. A discussion on adding an X-Iceberg-Client-Capabilities header to the REST spec https://lists.apache.org/thread/mmb0bb8sj12tj64swv8pmm10pqgrpo3c evolved into a full conversation about a v2 loadTable endpoint. Ryan Blue laid out the case for v2, including optional locations, optional snapshots, and moving credentials out of properties. Russell Spitzer agreed those are good reasons but questioned whether a v2 endpoint actually changes the capability negotiation problem the header was meant to solve. The sharpest pushback came from Christian Thiel of the Lakekeeper project, who challenged the sentiment that a v2 loadTable should mandate that clients fail when they encounter unsupported restrictions. His argument is grounded in adoption reality: a v2 endpoint gets adopted for many reasons, and strict failure semantics create friction for clients that have nothing to do with the restriction features. Kurtis Wright backed the v2 direction after missing the original community meeting discussion. This thread is the one to follow if you build or operate REST catalogs, because the outcome shapes how every engine negotiates features with every catalog for years. Step back and the stakes become clearer. The REST catalog spec is now the contract that binds the entire commercial Iceberg ecosystem together. Every managed catalog service, every query engine, and every standalone tool implements some slice of it, and those slices increasingly diverge in subtle ways. A capabilities header gives clients a standard way to declare what they understand, which lets catalogs make informed decisions about what to return. A v2 loadTable goes further by fixing accumulated design debt in the most heavily trafficked endpoint in the protocol. The tension Thiel identified is the classic protocol evolution dilemma: strict semantics protect correctness for new features like fine grained access control, where a client silently ignoring a row filter is a security incident, but strictness also slows adoption by punishing clients for capabilities unrelated to their workload. How the community threads that needle will determine whether v2 arrives as a clean upgrade path or a compatibility minefield. The fact that catalog implementers like Thiel, engine maintainers like Spitzer, and spec authors like Blue are all in the same thread arguing in good faith is the system working as designed. Prashant Singh's summary of the dedicated sync on finer grained read restrictions https://lists.apache.org/thread/13bzvwqmc4nj64qdo282lsr5t5w51r99 connects directly to that capabilities debate. The room landed on capabilities handling as a core piece of the fine grained access control design, and Singh posted the recording and an AI-assisted summary for those who could not attend. Sung Yun extended the FGAC conversation with a thoughtful post on a write-path gap for field-id-bound policies during schema evolution https://lists.apache.org/thread/cdph7m2hq5kmgfj5tq55o14nr31cynd3 . The read side of the proposal binds row filters and masks to field IDs so they survive schema evolution safely, but Yun points out that the write path has no equivalent story yet. Securing reads while leaving writes unguarded is a half-finished lock, so expect this gap to get attention as the proposal matures. Security work continued on a second front. Adam Szita published a spec proposal for KMS credential vending https://lists.apache.org/thread/fm09lcpc6q13hyfont47tvflfy9w9n7j through the REST catalog, separating credential management for KMS and Vault systems from the broader table encryption discussion. The intent is to let catalogs vend KMS credentials the same way they vend storage credentials today, which would make table-level encryption practical in multi-engine deployments where distributing key access manually does not scale. On the release front, Amogh Jahagirdar kicked off planning for 1.11.1 and 1.10.3 patch releases https://lists.apache.org/thread/k5pfk4rork0mp2p303pd70q9nx0tsl9w after encountering a bug where the Spark rewrite manifests procedure fails to carry over first row IDs correctly. The thread gathered 13 messages and quick consensus. Steven Wu pointed to the existing 1.11.1 milestone, Yufei Gu and Daniel Weeks added their support, and Weeks made the operating principle explicit: keep the 1.10 backports narrow so the release stays easy and helps anyone who has not yet moved forward. Meanwhile Neelesh Salian opened planning for Apache Iceberg 1.12.0 https://lists.apache.org/thread/w5qnj9gnl4rjhjnzyxlsbdzjx3kw9j8q with a direct acknowledgment that 1.11.0 took roughly eight months from 1.10.0, longer than the project wants. Steven Wu's response captured the philosophy the community is converging on: with a regular release habit, nobody needs to hold the release train for their feature, because the next train leaves in two to three months. Salian also published the Iceberg 1.11 feature branch retrospective https://lists.apache.org/thread/5opjppof69rq9f2lpxjt410667s8hc24 conclusion over on the Polaris list crossover thread, where Alexandre Dutra summarized the community's honest feedback by recommending the feature branch experiment not be repeated. The C++ implementation provided the week's endurance story. Junwang Zhao proposed RC0 of Apache Iceberg C++ 0.3.0 https://lists.apache.org/thread/vo8wfvp4cncggng31c5l6ksh7nnv1bsm on June 6, and what followed was a sprint through RC1 https://lists.apache.org/thread/fkvc8wmzokym5hmtv8gb3w6b9k8fgbp9 , RC2 https://lists.apache.org/thread/x5y5h0yk8vwzgm7r5b78o0tmp442hyky , and RC3 https://lists.apache.org/thread/8b145f8kw1mwy0ggn9jyfj6khbkb70l3 by June 11. Each candidate fixed issues the previous one surfaced. Matt Topol's RC2 verification caught real gaps in the release tooling, including undocumented meson and gtest requirements and an SSL workaround needed for the curl dependency, and Gang Wu called for improving the release script to catch similar issues automatically. By RC3, verification reports were coming in clean from macOS and Ubuntu environments across multiple contributors including Steven Wu, Raúl Cumplido, and Tanmay Rauth. Four release candidates in a week is not a failure story. It is what a healthy verification culture looks like when a young implementation is still hardening its release process. Spec precision got its own dedicated attention. Andrei Tserakhau called a vote to clarify that the day partition transform's result type is date https://lists.apache.org/thread/gz432tvboxvno2v7g3l17c8tbtxckxrb in the spec, gathering ten messages of support including binding +1s from Matt Topol and others within hours. The companion discussion on the Avro schema ambiguity for day transform fields in manifests https://lists.apache.org/thread/wx40fxmplrlsmwhyn8dqohm0ppnshgp1 shows why this dry-sounding clarification matters: Tserakhau noted the ambiguity bit someone again just last week on the Go side, where compacting a Spark-written table produced incompatible manifests. Kevin Liu suggested keeping the spec explanation format agnostic, and the fix landed in PR review. Small spec ambiguities compound into real interoperability bugs once five language implementations write the same metadata. The function catalog work crossed a milestone when huaxin gao's vote on REST spec endpoints for listing and loading functions https://lists.apache.org/thread/ttmmgzqhsdqt35stgtvzjmfgl42hgvw2 passed with ten +1 votes, five of them binding. Szehon Ho used his +1 to suggest tracking a specific-name for convenience over definition-id so engines can refer to each overloaded version of a function. With the spec change merging, Iceberg moves closer to catalogs that serve shared function definitions to every connected engine, which matters enormously for teams tired of reimplementing the same UDFs in Spark, Trino, and Flink. The variant data type push kept its momentum through two threads and a sync. Neelesh Salian posted the variant tracking document and sync notes https://lists.apache.org/thread/b2krpxxdqomlb19ffchmqwlsv8rhf59h , and the follow-up discussion on variant shredding policy across Iceberg implementations https://lists.apache.org/thread/sgo7g0voc2ctl1sr2fpf4qln5wmwlwwq tackled a subtle problem: aligning not just on the type definition but on how implementations shred variant values into columnar storage. Kurtis Wright praised the community for aligning on implementations rather than stopping at types. Shredding policy differences between engines would produce files that are technically spec compliant but perform wildly differently depending on which engine wrote them, so this alignment work protects the performance portability that makes Iceberg valuable. Performance optimization proposals arrived from Varun Lakhyani, who opened two related threads on cutting S3 request counts. His proposal to combine three GET calls for Parquet reads https://lists.apache.org/thread/yb8nom3w2zplb703m0p052kcc1wwotrr targets small file workloads where Iceberg currently issues two GETs for the footer and one for data when a single GET could fetch the whole file. The companion idea to store Parquet footer size in Iceberg metadata https://lists.apache.org/thread/csvfnhqgcpdbogb9yo29pdhdkbzdrrlq would let readers skip footer discovery entirely. For workloads on object storage where request costs and latency dominate, a two-thirds reduction in GET calls for small files is real money. Looking further ahead, Daniel Weeks proposed default value expressions for the v4 spec https://lists.apache.org/thread/w0xqrm0dpnsgvw0dyvy4r34y0dtzmn7f , building on the earlier expressions proposal to let defaults be computed rather than constant. Xiening Dai and Maninder Parmar continued working through global snapshot consistency for Iceberg tables https://lists.apache.org/thread/08nykzs7b9bdp1lvy0qnzglmbg1b254d , comparing a commit sequence number approach against a batch LoadTables API and concluding the two are complementary rather than contradictory. Mukund Thakur asked for review on his proposal for repartitioning old partition spec data files https://lists.apache.org/thread/4h6g5r633r65x5k92vqsn9ho0bhnry36 , which has been waiting since mid-May. Robert Kruszewski noticed that Iceberg's arrow-java dependency is more than two years old https://lists.apache.org/thread/9wp0xrr8jl6f615o335oooh9mjzxt2z5 at 15.0.2 and offered to drive the upgrade to 19.0.0. And Joana Hrotkó proposed exposing the commit retry exhaustion reason in failure messages https://lists.apache.org/thread/25zccjjpmrkx6pp350s64gvvvlx1lg18 , a small operability win for anyone who has stared at an opaque commit failure at 2 AM. Community infrastructure had a moment too. Bob Thomson from ASF Infra reported that Iceberg is the top consumer of shared GitHub-hosted runners https://lists.apache.org/thread/9s207npdlb76n458h209dgbgmfcttjz8 over the last seven days, with overall utilization maxing out daily. The timing was good, because Vova Kolmakov had already proposed running JDK 21 tests only on main and nightly builds https://lists.apache.org/thread/f9xhm6mwyspt15j06v14bkjjb4hts4yz to halve PR runner minutes, and Ajantha Bhat pointed to his open PR doing exactly that plus incremental CI builds, which has been waiting for review. On the events side, the Iceberg Summit 2027 location discussion https://lists.apache.org/thread/ngfrz7cdqpn2h97jm1zpfjctvclc3xzq turned into a friendly bidding war, with Viktor Kessler pitching Barcelona, Paris, and Berlin under the banner of making Iceberg global, while Danica Fine reminded everyone that Lakehouse Day EU in Glasgow https://lists.apache.org/thread/gc0wbgh7q8yh4hf1ctz7rfmqnyssg2th this October already gives the EMEA community a major gathering, co-located with Community Over Code and with its agenda now live. Kessler also announced the Iceberg Community Meetup Europe in Munich on July 22 https://lists.apache.org/thread/7sloq2kbmsvnwb7915dycpy9yb8s0cwy . Alex Stephen shared a healthy Iceberg Terraform Provider update https://lists.apache.org/thread/8xckb3h2rr421yswd2x53yb2zds8vmks with namespace and table management now supported, and huaxin gao posted notes from both the constraint support sync https://lists.apache.org/thread/yyx9x83s0dngf9py3lqvvxo07w10tw1k and the index support sync https://lists.apache.org/thread/b4rt9n4t703bps9qc8xo6tk9g3cx92k1 series. Polaris generated 114 emails across 23 threads this week, and the volume tells you something: this project is in the thick of working out what a production catalog service owes its operators. The biggest thread by message count was, surprisingly, about the merge button. Jean-Baptiste Onofré opened a PR to enable all three GitHub merge actions https://lists.apache.org/thread/92x2yz3ckjx31kfz77js90wyhsoxxq86 , adding merge commits and rebase-and-merge alongside the existing squash-and-merge, and the thread ran to 23 messages. Yong Zheng merged it before seeing the discussion, offered to revert, and JB waved it off with characteristic calm. The substantive objection came from Alexandre Dutra, who sees some value in rebase-and-merge when used wisely but struggles to imagine a useful case for merge commits, and worries about what happens when someone uses the wrong button on a messy branch. Twenty-three messages about merge strategies sounds like bikeshedding until you remember that commit history is how a project audits itself, and Polaris contributors clearly care about getting their development hygiene right while the project is still young enough to set habits. The week's best protocol discussion came from Nándor Kollár, who asked the community to settle the correct HTTP status code for table and view rename conflicts https://lists.apache.org/thread/tr8zh8121t2jb41s0q2yd9s73y2tp2tq when a conflicting operation is in progress. The current behavior returns a 500, which Dmitri Bourlatchkov reviewed and declared most certainly not correct, since 5xx codes signal fundamental service failure beyond the client's control. The candidates each have problems: 503 implies the whole service is unhealthy, 429 means rate limiting and is not defined for rename in the Iceberg REST spec, and 409 traditionally signals a conflict the client should not blindly retry. Seventeen messages in, the thread had become a genuinely useful seminar on REST semantics for catalog operations. The resolution matters beyond Polaris, because whatever convention Polaris adopts will influence how clients across the ecosystem implement retry logic for concurrent catalog operations. Operational maturity drove a cluster of related threads on events and metrics. Yong Zheng raised the need for a mechanism to purge the events and metrics tables https://lists.apache.org/thread/5nst0f2ygnl2gj3j910q7m8nk2fvokc7 , since Polaris now persists both event streams and Iceberg metrics with no retention story. Kollár noted the urgency grows as event persistence expands to more event types, and Bourlatchkov suggested the Admin tool as the natural home, similar to the existing NoSQL maintenance task. Zheng followed with a proposal for filters on Iceberg metrics reporting https://lists.apache.org/thread/ogskc1szctkg5n0tdj0cm3pfkowcwx4z , sketching expressions that match on catalog, namespace, and table name. Bourlatchkov floated CEL as the filter language before recalling that prior community consensus leaned toward removing CEL, leaving include and exclude lists with glob patterns as the likely landing spot. The largest design question in this cluster came from Yufei Gu, who proposed routing Iceberg scan and commit metrics through the events subsystem https://lists.apache.org/thread/x9j8nscvy8hq61tyn01mj8yp6n9of0kp rather than maintaining a parallel persistence path, since synchronous metrics persistence chokes the Polaris persistence layer. Anand Kumar Sankaran noted with a smile that his original metrics PR proposed exactly this before the community decided to keep them separate, and flagged that any change here is a breaking schema migration. Dutra found the events approach appealing but wants performance overhead evaluated thoroughly first. That events subsystem got its own scrutiny in Dutra's thread on event delivery ordering and concurrency guarantees https://lists.apache.org/thread/yhs40z7r90mdpqbfzpwhqgxdrd8pln96 , prompted by a PR that shifted delivery to a blocking executor. The previous behavior implicitly relied on Vert.x event bus semantics that nobody had written down. Kollár argued listeners should be documented as thread-safe and that strict ordering rarely matters as long as every event arrives, and Gu took the pragmatic position: keep ordered delivery as the only behavior now, and introduce unordered delivery only if a real need appears. Documenting implicit guarantees before users depend on them accidentally is exactly the kind of unglamorous work that separates production infrastructure from promising prototypes. JB's Polaris Directories proposal https://lists.apache.org/thread/vr3tbs2ggp5fn5qtcz6br4srgvsoknrv advanced after several months of design work, and the discussion sharpened around one architectural question: where does the scanner live? Gu argued that if the scanning component sits completely outside Polaris, the user experience becomes confusing, with Polaris storing only directory configuration while real work happens elsewhere. JB clarified his two-step plan, landing configuration and high-level architecture first, then building the scanning service as part of Polaris proper. Romain Manni-Bucau pushed on extensibility, asking whether users can plug in their own metadata and whether scanning will be streaming friendly rather than batch only. Directories would give Polaris a way to govern data that has not yet been formalized into Iceberg tables, which extends the catalog's reach into the messy reality of most data lakes. Release machinery is turning for Apache Polaris 1.6.0, targeted around June 26 https://lists.apache.org/thread/1kmf1bqp0js8wjqj7pzr8y3z66ff0sss . EJ Wang reported no must-have blockers and plans to cut from main, while Adnan Hemani asked to land one PR first, a fix for a documentation versioning issue that had gone unreported for a while. JB updated the release process documentation to match. In parallel, the project took a step toward friendlier adoption when Yong Zheng proposed promoting the polaris CLI from PyPI https://lists.apache.org/thread/gf1zxlnyflqbwnrrx4jbbffnjtd0ngdb as the recommended setup for non-development use, sparing users a full repository clone. Gu, JB, and Hemani all backed it immediately. Two storage-layer threads rounded out the design work. Gu's proposal for making unique table locations the default https://lists.apache.org/thread/wnssxy75j5fb4ytpsfy5z55fvzx3yg3q won quick support from Russell Spitzer, who endorsed taking determinism out of table creation paths as a safety improvement. Bourlatchkov raised an important operational catch: with randomized locations, long-running staged create operations like CTAS face a credential refresh problem, connecting to the credential refresh discussion https://lists.apache.org/thread/ypdotvvvnndrhm7hv5cps37w4dphl8j6 Gu had flagged earlier in the week and to active design work on the Iceberg side. Bourlatchkov also recapped community sync consensus on supporting multiple storage configurations per catalog https://lists.apache.org/thread/7g400hw4rhfzz4f5wdslrqd6ft02jd2g , with authorization aspects deferred. And the Iceberg table encryption discussion https://lists.apache.org/thread/z27s3rxbkbz706c7qo736ojlf3kjv3mq continued between Gu and Bourlatchkov, working through whether Polaris can realistically test against encrypted Iceberg tables today. The answer is yes with caveats, and the work proceeds incrementally starting with internal Polaris workflows that touch encrypted files. Testing infrastructure produced this week's most quietly notable line. In the object storage mock testing thread https://lists.apache.org/thread/19zk75fo5vh71k227fbsyrcxgthnn2hm , Russell Spitzer shared a proof of concept he implemented with Claude's help, comparing approaches for testing file operations without real cloud containers. Robert Stupp agreed the POC clarifies the layering problem and they converged on a split: synthetic FileIO for generated listings and pure file operation behavior, real containers where fidelity matters. Bourlatchkov also opened threads on retiring the regtests code https://lists.apache.org/thread/5gjfrwlztz5c75pk586gwtnq41lydhnq in favor of Yong's new Spark smoke tests, fixing a Principal Role validation regex https://lists.apache.org/thread/9jwckjn6obxl8fb6dlj18y15ckxop3t4 through a REST spec change, and a subtle JSONB reformatting issue in PostgreSQL persistence https://lists.apache.org/thread/0vwl1w207n6vpkm8pgjv4vbpg0307g91 that argues for semantic JSON comparison in entity tests. The lineage conversation kept building. Adnan Hemani and Robert Stupp continued their OpenLineage follow-up https://lists.apache.org/thread/yxon21n43vofrnzxyh42yyh339c1nnw7 by working through what Polaris should do when lineage events reference non-Polaris datasets on both ends, with Stupp calling for broader community input because the options on the table represent materially different commitments. And Sankaran proposed a GCP counterpart to AWS STS session tags https://lists.apache.org/thread/yq1sz8y0nkfhloycw9lrqtc9k084ln2f so Polaris can correlate vended-credential data access back to the catalog operation that issued the credential on Google Cloud, closing an auditability gap between cloud providers. Taken together, the week's Polaris threads sketch the profile of a catalog growing into production responsibilities. Almost nothing this week was about new catalog features in the demo sense. Instead the community worked on retention for its own telemetry, correct HTTP semantics under concurrency, documented threading guarantees, credential lifecycle edge cases in staged writes, audit correlation across clouds, and test infrastructure that does not require a cloud bill. This is the unglamorous middle phase of an infrastructure project's life, after the architecture is proven and before the enterprise checklists are fully satisfied, and how a community handles this phase predicts whether operators will trust it with their metadata five years from now. The Polaris regulars, a group of roughly fourteen people this week, are handling it with notable discipline, and the 1.6.0 release later this month will carry the early fruits of that work. Arrow had a steadier week at 24 emails across 11 threads, anchored by a release and a governance decision about AI tooling. Andrew Lamb shepherded Apache Arrow Rust 59.0.0 through its RC2 vote https://lists.apache.org/thread/xlozjylbqfo7tgh2lcvb6d3dvj5bwwxd after RC1 hit a verification problem that Ed Seidl fixed. Verification reports came in from Seidl on RHEL 8, Raúl Cumplido on Debian 14 with Rust 1.96, Adam Reeve on Fedora 44, and L. C. Hsieh, and Lamb announced the result https://lists.apache.org/thread/zmyp2zf4g3snxsc6nl977y6fm4g39stk with five +1 votes, four binding, publishing to crates.io. The arrow-rs release train remains one of the most reliable in the ecosystem, which matters because half the Rust data infrastructure world, DataFusion included, builds directly on it. The discussion on automatic GitHub Copilot reviews https://lists.apache.org/thread/y7yc4yg9n4mdqd1y00w7s498y8m6yold produced one of the more thoughtful AI governance conversations in the ASF right now. After two weeks of testing, Cumplido found the reviews useful for ready PRs but wants them disabled for drafts, since a draft signals work in progress and an immediate bot review adds noise. Lamb agreed they help as an initial pass and pushed for documenting what contributors are expected to do with bot feedback. Sutou Kouhei synthesized the feedback into a PR with a pragmatic split: first-time contributors get one policy, returning contributors another. Alenka Frim asked the practical question nobody had answered, which is when Copilot actually considers itself satisfied with a PR, since nobody had seen it grant an approval. Arrow is writing down norms for AI participation in code review while most projects are still improvising, and other communities will likely copy this homework. The format itself saw movement on two fronts. The arrow.range canonical extension type discussion https://lists.apache.org/thread/ofnxc1jsymppshbhrtqxtos9dw00wo3y wrestled with naming and semantics for bounded ranges, with Felipe Oliveira Carvalho proposing distinct types per boundary closedness, half-open, closed, and the variations between, rather than a single parameterized type. And the variant type support thread https://lists.apache.org/thread/b9ydqw5bm14htozzn1mxfr240bl2dn0s surfaced a coordination problem: Gang Wu pointed out that several duplicate efforts are underway on variant support in Arrow C++, including work by his colleague Zehua that iceberg-cpp already depends on. Micah Kornfield confirmed community interest and pointed to the freshly opened tracking issue. Duplicate implementations of the same type are wasted effort the dev list exists to prevent, so expect consolidation here. The Arrow family also grew. Following the donation vote, Benjamin Philip transferred the Arrow Erlang repository https://lists.apache.org/thread/6ww38cgnyq3ly176nrg1wy1o2zwsjnv1 to the ASF, and Kouhei confirmed it now lives at apache/arrow-erlang with repository setup landing next week. Flight SQL picked up two small protocol wins, with Pedro Matias closing the vote on the is update field https://lists.apache.org/thread/tkpk2c04f7gc73rdo1wmr48mcn8l0x0s for prepared statement results with four binding +1s and work proceeding on Go, Java, ADBC, and JDBC implementations, while Richie Black's COLUMN DEF addition to Flight SQL JDBC schema metadata https://lists.apache.org/thread/6fjb9dp3j7q3cw0l975bog2n5t7zd82c moved through its own vote. And in a thread that touches Arrow's measurement culture, Rok Mihevc and Jonathan Keane discussed the status of conbench https://lists.apache.org/thread/n6hxqojh510b4sgf0ojbmbt98kx82vyo , Arrow's continuous benchmarking project, with Mihevc interested in having his agents work on it and Keane happy to see anyone pick it up. The phrase "having your agents work on it" passing without comment in an ASF dev thread says plenty about where 2026 is. Arrow's quieter week should not be mistaken for a quiet project. The format has reached the stage where its biggest contributions happen downstream, in arrow-rs powering DataFusion and a growing share of the Rust analytics ecosystem, in ADBC and Flight SQL steadily replacing bespoke wire protocols, and in the C++ library serving as the substrate for iceberg-cpp and the engines built on it. That last dependency is why the variant duplication issue deserves a faster resolution than it might otherwise get. With Iceberg, Parquet, and Spark all converging on variant as the standard answer for semi-structured data, Arrow C++ sits in the critical path for every engine that wants to read shredded variant columns efficiently, and two parallel implementations means review attention split exactly where the ecosystem can least afford it. Wu naming the problem publicly, with a disclaimer about his colleague's involvement, is the dev list doing its job. Parquet packed 72 emails into just 7 threads, and one of them was the week's heavyweight across the entire lakehouse ecosystem. The Future of Parquet Versioning discussion https://lists.apache.org/thread/5nx8r1y2qyotvg9ov5pl99dl498twt7m ran to 40 messages and pulled in nearly everyone who matters to the format: Ed Seidl, Andrew Lamb, Antoine Pitrou, Micah Kornfield, Daniel Weeks, Russell Spitzer, Ryan Blue, Fokko Driesprong, and Andrew Bell. The thread got off to an inauspicious start when the Google Doc anchoring the discussion started throwing terms of service violations for Seidl, Lamb, and others, an ironic argument for keeping foundational decisions in plain text on the mailing list. The substance is the question Parquet has deferred for a decade: what does a version number actually promise? Bell asked the question every practitioner asks, which is how a reader knows it has the tooling to read a given file, and what the hesitation is to simply bump version numbers. Seidl's answer exposed the uncomfortable status quo: today there is no in-use mechanism beyond parsing the created by string, which means readers infer capabilities from writer name-dropping. The debate continues over whether Parquet should adopt feature flags, real version increments, or some hybrid, and the outcome will define how the format evolves for its second decade. The reason this debate is happening now, rather than five years ago, is that Parquet's roadmap has filled up with changes that strain the old informal model. Variant types, geometry types, new statistics, the footer redesign, and dense encodings are all arriving in a short window, and each one forces the same question of how a reader discovers it can safely consume a file. The created by approach worked when two or three writers dominated and everyone could memorize each other's quirks. With a dozen serious implementations across Java, C++, Rust, Go, and Python, capability discovery by string parsing is a correctness bug waiting to happen at every reader-writer pairing. The versioning thread is really an interoperability thread wearing a version number costume, and the contributors arguing in it know that whatever mechanism wins must serve files that will still be read decades from now. Formats outlive engines, and they outlive companies. That is precisely why 40 messages of careful argument is time well spent. Lamb attacked the same problem from the documentation side. Convinced by recent discussions that the community must document what V1 and V2 actually mean, messy reality included, he spent several days producing a feature-by-version documentation page https://lists.apache.org/thread/0jwhc6bdwptlormb4xpk07hnzfyz4p6p . Pitrou pushed back with a precise objection: the page invents an a posteriori meaning for V1 and V2, and he questioned why parquet-format 2.0.0 deserves to be singled out as a meaningful boundary. Lamb conceded that earlier drafts did try to invent definitions and revised toward describing what shipped rather than what the labels should have meant. This exchange is the versioning debate in miniature. The community is discovering that before it can design future versioning, it has to agree on a truthful account of past versioning. While the philosophy unfolded, the release train kept moving. Gang Wu confirmed in the 2.13.0 release discussion https://lists.apache.org/thread/n0949bqh4dgjhmqym9kkv5y277zk0n0y that making ColumnMetaData.path in schema optional needs more discussion and will not block the release, with Fokko Driesprong and Kornfield agreeing to proceed. The vote on Apache Parquet Format 2.13.0 RC0 https://lists.apache.org/thread/7kjqsz7n8cwqpgfo2h9c5q0csml77d86 collected binding +1s from Kornfield and others, with Seidl's vote carrying the best line of the week: we have waited long enough for usable float statistics. Sortable floating point statistics have been a known gap for years, and 2.13.0 finally closes it. The footer redesign work formalized its process. Jiayi Wang scheduled session 2 of the Parquet Footer Working Group https://lists.apache.org/thread/vz2n5qkkl4godby448lznc36sv9jxhgj , moving to a biweekly cadence, and Pitrou immediately raised the governance flag: for a change as foundational as the footer, decisions cannot be made in sync calls and merely reported to the list afterward. Wang agreed without hesitation, committing that syncs will inform but the mailing list will decide. Given that the footer working group is rethinking how every Parquet reader on earth bootstraps file access, insisting on mailing list primacy is not process pedantry. It is how the ASF model protects a format that multiple competing vendors depend on. Two type system proposals advanced. Burak Yavuz moved the new File logical type proposal https://lists.apache.org/thread/m5hvh3mdgjl4482ws09wfzosotf01kqq from design doc to pull requests against parquet-format and the reference implementation, after the Parquet sync aligned on keeping the field simple and minimalistic. Daniel Weeks followed up with additional context from the sync discussion. A File logical type gives engines a standard way to represent file references inside Parquet data, which matters for multimodal and document-heavy workloads where tables increasingly point at external binary content. And Divjot Arora closed the loop on the long-running INT96 statistics question https://lists.apache.org/thread/9zl109s34zzzhjlnvls4g8mobb2hydcy , announcing the community has settled on introducing a new ColumnOrder to signal statistics validity for INT96 columns. Seidl endorsed it immediately, noting a new ColumnOrder is far preferable to parsing created by strings, and offered a Rust proof of concept once the format PR lands. Notice the pattern: two separate threads this week independently identified created by string parsing as the anti-pattern to eliminate. DataFusion makes its second appearance in this newsletter with a lighter week by volume, 13 emails across 6 threads, but the quality of its release process was on full display. The vote on Apache DataFusion 54.0.0 RC1 https://lists.apache.org/thread/lxr1tbtz329zz3lykjoxttl7ypch71sx featured the kind of drama that proves verification works. Matt Butrovich cast a -1 after Comet, the Spark accelerator built on DataFusion, showed large performance regressions on TPC-H and TPC-DS at scale factor 1000 that appeared related to Parquet metadata parsing. Andrew Lamb connected it to a similar report from Adam in the Vortex project tied to new metadata cache size limits. Butrovich investigated further, found Adam's issue went through the ListingTable API that Comet does not use, could not reproduce the regression in DataFusion alone, and retracted his -1 while deferring the Comet upgrade for more investigation. Lamb then announced the release approved https://lists.apache.org/thread/wgtdp9nrbh8p14clf08c5t9wj3q51ro4 with 11 +1 votes, 7 binding. A downstream consumer running thousand-scale-factor benchmarks against a release candidate and the project taking the result seriously is exactly how the Rust data stack has earned its reputation. Lamb also submitted the ASF board report https://lists.apache.org/thread/09y1l7f10o22dx393ln7y9wnl48soblx after crowdsourcing input from the community, and opened the 2026 Q3-Q4 roadmap discussion https://lists.apache.org/thread/x2k66nv46289ofcnlntcrv0gy83w1g8g with a tracking ticket inviting the community to say where it wants the project to go. Recognition arrived from inside the foundation too, with Rich Bowen inviting the project to a PlusOne.apache.org interview https://lists.apache.org/thread/195rmvn2jzyclxsk5243gt5bs4xf1771 , citing the 54.0 release, the new Java bindings, and a remarkable growth trajectory. Meanwhile Bob Thomson's infra review brought good news on the resource front: DataFusion has dropped out of the top consumers https://lists.apache.org/thread/znby696kvqb31vbybdysko251mntqb4g of ASF shared GitHub runners after recent CI optimization work that Oleks V. helped drive, the same week Iceberg learned it now tops that list. One project's playbook is sitting right there for the other to borrow. The week's loudest theme is that format governance is becoming as important as format features. Parquet's 40-message versioning debate, Pitrou's insistence that footer decisions happen on the list rather than in syncs, Iceberg's question about when a draft spec in git becomes a finalized spec, and even the Polaris merge button thread are all the same conversation: as these projects become load-bearing infrastructure for the industry, the process by which they change matters as much as the changes themselves. Two separate Parquet threads independently named created by string parsing as the failure mode to engineer away, which is what happens when a format relies on convention where it needs specification. Iceberg's day transform clarification, prompted by a real interoperability bug between Spark-written and Go-compacted tables, is the same lesson at smaller scale. The variant type is now a genuinely cross-project effort, and this week showed both its promise and its coordination cost. Iceberg contributors aligned on shredding policy across implementations, Arrow surfaced duplicate variant implementations in C++ that need consolidation, and iceberg-cpp already depends on one of them. Semi-structured data support is arriving across the whole stack at once, which is exactly why the alignment syncs Neelesh Salian is running matter. Metadata efficiency formed a third connective thread: Iceberg proposals to cut GET calls and store footer sizes, the Parquet footer working group rethinking file bootstrap, and a DataFusion release candidate nearly held up by metadata cache behavior all point at the same bottleneck. The data files are fast. The metadata round trips are the tax everyone is now optimizing. Finally, AI is quietly becoming part of how these communities work. Arrow is writing policy for Copilot reviews, Russell Spitzer prototyped Polaris test infrastructure with Claude's help, Iceberg syncs circulate AI-assisted summaries, and Rok Mihevc casually offered his agents for conbench maintenance. None of this was framed as remarkable by the participants, which is the remarkable part. For practitioners, the week distills into three watch items. First, if you operate REST catalogs or pin client versions in production, the v2 loadTable and capabilities outcome will eventually reach your upgrade planning, so the time to read that thread is before the vote rather than after. Second, the metadata efficiency work across Iceberg and Parquet signals that small file performance on object storage is getting first-class attention at the format level, which may relieve pressure on some of the compaction gymnastics teams perform today, even though compaction remains essential for the foreseeable future. Third, the float statistics fix in parquet-format 2.13.0 and the INT96 ColumnOrder decision both close long-standing correctness gaps in predicate pushdown, and engines will pick these up over the coming release cycles, so expect quiet query performance improvements on float-heavy datasets without changing a line of your own code. Watch for the Iceberg C++ 0.3.0 RC3 result and the outcome of the v2 loadTable capabilities debate, which will shape REST catalog evolution well beyond this release cycle. Polaris 1.6.0 branches around June 26, the Parquet footer working group reconvenes June 23 with its mailing-list-first commitment in place, and the parquet-format 2.13.0 vote should close with float statistics finally fixed. The Iceberg patch releases 1.11.1 and 1.10.3 should move to votes shortly, and the Parquet versioning thread shows no sign of slowing down. The Iceberg variant shredding alignment and the Arrow C++ variant consolidation are worth tracking as a pair, since the semi-structured data story only works if both layers land compatible implementations. On the community calendar, Munich hosts the Iceberg Europe meetup July 22, Lakehouse Day EU registration is open for Glasgow in October, and the Iceberg Summit 2027 location conversation is just getting started, with European cities making an energetic early case. If the past week is any guide, the next one will be busy. Get Started with Dremio Free Downloads Books by Alex Merced