{"slug": "using-apache-iceberg-with-python-and-mpp-query-engines", "title": "Using Apache Iceberg with Python and MPP Query Engines", "summary": "This article, part 12 of a 15-part Apache Iceberg Masterclass, explains how to access Iceberg data using Python libraries like PyIceberg, DuckDB, and Polars for local analysis, as well as MPP query engines like Dremio for production-scale workloads. It details how PyIceberg reads metadata directly for efficient data scanning and writing, while DuckDB and Polars offer fast, in-process analytical querying. The article concludes that while Python libraries are suitable for single-machine tasks, MPP engines are necessary for handling petabyte-scale datasets.", "body_md": "This is Part 12 of a 15-part [Apache Iceberg Masterclass](https://iceberglakehouse.com/posts/). [Part 11](https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-11/) covered metadata tables. This article covers the two main ways to access Iceberg data: directly from Python libraries and through MPP (massively parallel processing) query engines.\n\n## Table of Contents\n\n[What Are Table Formats and Why Were They Needed?](https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-01/)[The Metadata Structure of Current Table Formats](https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-02/)[Performance and Apache Iceberg's Metadata](https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-03/)[Technical Deep Dive on Partition Evolution](https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-04/)[Technical Deep Dive on Hidden Partitioning](https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-05/)[Writing to an Apache Iceberg Table](https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-06/)[What Are Lakehouse Catalogs?](https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-07/)[Embedded Catalogs: S3 Tables and MinIO AI Stor](https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-08/)[How Iceberg Table Storage Degrades Over Time](https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-09/)[Maintaining Apache Iceberg Tables](https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-10/)[Apache Iceberg Metadata Tables](https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-11/)[Using Iceberg with Python and MPP Engines](https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-12/)[Streaming Data into Apache Iceberg Tables](https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-13/)[Hands-On with Iceberg Using Dremio Cloud](https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-14/)[Migrating to Apache Iceberg](https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-15/)\n\n## The Python Ecosystem for Iceberg\n\n### PyIceberg: Native Python Access\n\nPyIceberg is the official Python library for Apache Iceberg. It reads Iceberg metadata directly and can scan data files without an external query engine.\n\n``` python\nfrom pyiceberg.catalog import load_catalog\n\n# Connect to a REST catalog\ncatalog = load_catalog(\"my_catalog\", **{\n    \"type\": \"rest\",\n    \"uri\": \"https://catalog.example.com\",\n})\n\n# Load and scan a table\ntable = catalog.load_table(\"analytics.orders\")\nscan = table.scan(row_filter=\"amount > 100\")\ndf = scan.to_pandas()\n```\n\nPyIceberg leverages Iceberg's [metadata-driven pruning](https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-03/): the `row_filter`\n\nis pushed down to manifest evaluation, so only relevant data files are read. For reading subsets of large tables into Python for analysis or ML training, this is remarkably efficient.\n\nPyIceberg also supports writes (appending data from Arrow tables), schema evolution, and table management operations. It connects to any catalog that implements the REST protocol, including [Dremio Open Catalog](https://www.dremio.com/platform/open-catalog/).\n\n### DuckDB: SQL-Based Python Analysis\n\nDuckDB can read Iceberg tables through its Iceberg extension:\n\n``` python\nimport duckdb\n\nconn = duckdb.connect()\nconn.execute(\"INSTALL iceberg; LOAD iceberg;\")\n\ndf = conn.execute(\"\"\"\n    SELECT customer_id, SUM(amount) as total\n    FROM iceberg_scan('s3://warehouse/orders')\n    GROUP BY customer_id\n\"\"\").fetchdf()\n```\n\nDuckDB processes the query locally using its columnar execution engine, which is significantly faster than pandas for analytical queries. It supports Iceberg's partition pruning and column statistics for file skipping. DuckDB runs entirely in-process, so there is no separate server to manage. This makes it a strong choice for local analysis, CI/CD data validation, and notebooks where starting a Spark cluster would be overkill.\n\nDuckDB also supports reading Iceberg metadata tables, which means you can use it for [table health diagnostics](https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-09/) without standing up a full query engine.\n\n### Polars: High-Performance DataFrames\n\nPolars can read Iceberg tables through its `scan_iceberg`\n\nmethod, providing lazy evaluation and parallel processing:\n\n``` python\nimport polars as pl\n\ndf = pl.scan_iceberg(\"s3://warehouse/orders\").filter(\n    pl.col(\"amount\") > 100\n).collect()\n```\n\nPolars uses a lazy evaluation model: the `scan_iceberg`\n\ncall does not read data immediately. Instead, it builds an execution plan. When `collect()`\n\nis called, Polars optimizes the plan (predicate pushdown, column pruning, parallel reads) and executes it. For large Iceberg tables, Polars can scan data several times faster than pandas because it uses all available CPU cores and processes data in Apache Arrow columnar format.\n\n### Writing from Python\n\nPyIceberg supports writes through Apache Arrow tables:\n\n``` python\nimport pyarrow as pa\n\n# Create an Arrow table with new data\nnew_data = pa.table({\n    \"order_id\": [1001, 1002, 1003],\n    \"amount\": [150.00, 275.50, 89.99],\n    \"order_date\": [\"2024-03-15\", \"2024-03-15\", \"2024-03-16\"],\n})\n\n# Append to the Iceberg table\ntable.append(new_data)\n```\n\nThis creates a new Iceberg [commit](https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-06/) with the data files, manifests, and metadata. PyIceberg handles the entire write lifecycle, including partition assignment based on the table's [partition spec](https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-05/).\n\nFor bulk writes from Python, using PyIceberg with Arrow is often simpler than setting up Spark. However, PyIceberg runs on a single machine, so it is not suitable for writing terabyte-scale datasets. For that, use an MPP engine.\n\n## MPP Query Engines\n\nFor production workloads at scale, Python libraries running on a single machine are not sufficient. MPP engines distribute query execution across multiple nodes, handling petabyte-scale tables with sub-minute response times.\n\n### Dremio\n\n[Dremio](https://www.dremio.com/blog/apache-iceberg-101-your-guide-to-learning-apache-iceberg-concepts-and-practices/) provides full Iceberg support with several unique capabilities: [query federation](https://www.dremio.com/platform/federation/) across Iceberg and non-Iceberg sources, [automatic table optimization](https://www.dremio.com/blog/table-optimization-in-dremio/) through Open Catalog, a [semantic layer](https://www.dremio.com/platform/semantic-layer/) for governed access, and [AI-powered analytics](https://www.dremio.com/platform/ai/) through its built-in agent and MCP server.\n\nFor Python users, Dremio exposes data through Apache Arrow Flight, which is a high-performance data transfer protocol. Arrow Flight sends data in columnar Arrow format directly to the client, avoiding the serialization overhead of JDBC/ODBC. This makes it 10-100x faster than traditional database connectors for large result sets:\n\n``` python\nfrom dremio_simple_query import DremioConnection\n\nconn = DremioConnection(\"https://your-dremio.cloud\", token=\"...\")\ndf = conn.query(\"SELECT * FROM analytics.orders WHERE amount > 100\")\n```\n\nThe result is a pandas DataFrame populated via Arrow Flight. Because the data stays in Arrow format end-to-end (Iceberg Parquet to Dremio to Arrow Flight to pandas), there are no format conversion bottlenecks.\n\nDremio also provides a [Columnar Cloud Cache](https://www.dremio.com/blog/dremios-columnar-cloud-cache-c3/) that stores frequently accessed data on local NVMe drives, making subsequent queries against the same Iceberg data dramatically faster without requiring reflections or materialized views.\n\n### Spark\n\nApache Spark is the most mature Iceberg engine for both reads and writes. It handles batch ETL, streaming ingestion ([Part 13](https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-13/)), and all [maintenance operations](https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-10/). Most Iceberg production pipelines use Spark for data ingestion because of its extensive connector ecosystem (Kafka, JDBC, file formats) and its ability to process large volumes across a distributed cluster.\n\nSpark supports all Iceberg operations: CREATE, INSERT, MERGE, DELETE, UPDATE, schema evolution, partition evolution, and every maintenance procedure (compaction, snapshot expiry, orphan cleanup).\n\n### Trino\n\nTrino (formerly PrestoSQL) is optimized for interactive, ad-hoc queries with low latency. It reads and writes Iceberg tables and supports the REST catalog protocol. Trino is popular for exploration and dashboarding workloads where sub-second response times matter and data is being read rather than written. Its architecture keeps no persistent state, making it easy to scale up and down based on query demand.\n\n### Other Engines\n\nSeveral other engines provide Iceberg support: AWS Athena (serverless, AWS-native), Snowflake (read-only for external Iceberg tables), StarRocks (sub-second analytics), and Doris (real-time analytics). The Iceberg community maintains a [compatibility matrix](https://iceberg.apache.org/multi-engine-support/) showing which engines support which operations.\n\n### Choosing the Right Approach\n\nThe key takeaway: Python libraries (PyIceberg, DuckDB, Polars) are best for local analysis and development. MPP engines (Dremio, Spark, Trino) are necessary for production-scale analytics. Many teams use both: PyIceberg for data science experimentation, and Dremio for production dashboards and governed access.\n\n[Part 13](https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-13/) covers how to stream data into Iceberg tables.\n\n### Books to Go Deeper\n\n-\n[Architecting the Apache Iceberg Lakehouse](https://www.amazon.com/Architecting-Apache-Iceberg-Lakehouse-open-source/dp/1633435105/)by Alex Merced (Manning) -\n[Lakehouses with Apache Iceberg: Agentic Hands-on](https://www.amazon.com/Lakehouses-Apache-Iceberg-Agentic-Hands-ebook/dp/B0GQL4QNRT/)by Alex Merced -\n[Constructing Context: Semantics, Agents, and Embeddings](https://www.amazon.com/Constructing-Context-Semantics-Agents-Embeddings/dp/B0GSHRZNZ5/)by Alex Merced -\n[Apache Iceberg & Agentic AI: Connecting Structured Data](https://www.amazon.com/Apache-Iceberg-Agentic-Connecting-Structured/dp/B0GW2WF4PX/)by Alex Merced -\n[Open Source Lakehouse: Architecting Analytical Systems](https://www.amazon.com/Open-Source-Lakehouse-Architecting-Analytical/dp/B0GW595MVL/)by Alex Merced", "url": "https://wpnews.pro/news/using-apache-iceberg-with-python-and-mpp-query-engines", "canonical_source": "https://dev.to/alexmercedcoder/using-apache-iceberg-with-python-and-mpp-query-engines-1d0", "published_at": "2026-05-22 16:35:48+00:00", "updated_at": "2026-05-22 17:07:03.468562+00:00", "lang": "en", "topics": ["data", "open-source", "developer-tools", "cloud-computing", "enterprise-software"], "entities": ["Apache Iceberg", "PyIceberg", "DuckDB", "Dremio Open Catalog", "Python", "REST", "Amazon S3", "Arrow"], "alternates": {"html": "https://wpnews.pro/news/using-apache-iceberg-with-python-and-mpp-query-engines", "markdown": "https://wpnews.pro/news/using-apache-iceberg-with-python-and-mpp-query-engines.md", "text": "https://wpnews.pro/news/using-apache-iceberg-with-python-and-mpp-query-engines.txt", "jsonld": "https://wpnews.pro/news/using-apache-iceberg-with-python-and-mpp-query-engines.jsonld"}}