{"slug": "approaches-to-streaming-data-into-apache-iceberg-tables", "title": "Approaches to Streaming Data into Apache Iceberg Tables", "summary": "This article, Part 13 of a 15-part Apache Iceberg Masterclass, outlines three primary methods for streaming data into Iceberg tables: Spark Structured Streaming, Apache Flink, and the Iceberg Sink Connector for Kafka Connect. It explains that while Iceberg was designed for batch analytics, these approaches bridge the gap by committing data at regular intervals, creating a trade-off between data freshness and the small file problem. The article details the latency, small file impact, and ideal use cases for each method, highlighting Flink's exactly-once semantics and CDC support versus the operational simplicity of the Kafka Connect sink.", "body_md": "This is Part 13 of a 15-part Apache Iceberg Masterclass. Part 12 covered Python and MPP engines. This article covers the three primary approaches to streaming data into Iceberg tables and the operational trade-offs each creates.\nIceberg was designed for batch analytics, but most production data arrives continuously. Streaming ingestion bridges this gap by committing data to Iceberg tables at regular intervals. The challenge is that frequent commits create the small file problem, and managing that trade-off between data freshness and table health is the central concern of streaming to Iceberg.\nSpark Structured Streaming processes data in micro-batches and commits to Iceberg at configurable intervals:\ndf = spark.readStream.format(\"kafka\") \\\n.option(\"subscribe\", \"events\") \\\n.load()\ndf.writeStream.format(\"iceberg\") \\\n.outputMode(\"append\") \\\n.option(\"checkpointLocation\", \"s3://checkpoint/events\") \\\n.trigger(processingTime=\"60 seconds\") \\\n.toTable(\"analytics.events\")\nEach trigger creates a new Iceberg commit with the accumulated data. A 60-second trigger produces 1,440 commits per day, each adding a small number of files.\nLatency: Seconds to minutes (configurable via trigger interval).\nSmall file impact: Moderate. Longer trigger intervals produce fewer, larger files.\nBest for: Teams already using Spark for batch processing who want to add near-real-time ingestion.\nFlink processes events continuously and commits to Iceberg at checkpoint intervals:\n-- Flink SQL\nINSERT INTO iceberg_catalog.analytics.events\nSELECT event_id, event_time, payload\nFROM kafka_source\nFlink's checkpointing mechanism determines commit frequency. A 30-second checkpoint interval produces commits every 30 seconds with whatever data has accumulated.\nExactly-once semantics: Flink's checkpoint mechanism provides exactly-once delivery guarantees to Iceberg. If a Flink job crashes, it recovers from its last checkpoint and replays any data that was not yet committed to Iceberg. This means no duplicate records and no data loss, which is critical for financial and transactional data pipelines.\nPartitioned writes: Flink can route events to partitions dynamically based on partition transforms. Combined with Iceberg's hidden partitioning, this means streaming data lands in the correct partition directory automatically without any special logic in the streaming application.\nUpserts and CDC: Flink supports changelog streams (insert, update, delete operations) and can write them to Iceberg as equality deletes and data files. This enables CDC (change data capture) patterns where a database's transaction log is streamed directly into an Iceberg table, maintaining a near-real-time copy.\nLatency: Seconds (tied to checkpoint interval).\nSmall file impact: High. Frequent checkpoints produce many small files.\nBest for: Teams needing the lowest-latency streaming with exactly-once semantics and CDC support.\nThe Iceberg Sink Connector reads directly from Kafka topics and writes to Iceberg tables:\n{\n\"name\": \"iceberg-sink\",\n\"config\": {\n\"connector.class\": \"org.apache.iceberg.connect.IcebergSinkConnector\",\n\"topics\": \"events\",\n\"iceberg.catalog.type\": \"rest\",\n\"iceberg.catalog.uri\": \"https://catalog.example.com\",\n\"iceberg.tables\": \"analytics.events\"\n}\n}\nLatency: Minutes (Kafka Connect batches records before committing).\nSmall file impact: Lower than Spark/Flink because commits are less frequent.\nBest for: Organizations with existing Kafka infrastructure that want a managed connector approach.\nApache Iceberg Sink Connector: The community-maintained Iceberg Sink Connector for Kafka Connect supports schema evolution from Kafka's Schema Registry, automatic table creation, and partition routing. It reads records from Kafka topics, buffers them in memory, and commits to Iceberg in configurable batch intervals.\nOperational simplicity: Kafka Connect is a managed framework. You deploy the connector configuration, and Kafka Connect handles scaling, offset management, and fault recovery. There is no custom application code to write or maintain. For organizations that already run Kafka Connect for other sinks (databases, search indexes), adding an Iceberg sink is straightforward.\nEvery streaming approach shares the same fundamental problem: frequent commits produce small files. The solution is to pair streaming ingestion with aggressive compaction.\nA typical production pattern:\nDremio's automatic table optimization handles this compaction automatically for tables managed by Open Catalog. AWS S3 Tables also provides built-in compaction for streaming workloads.\nThe key insight: you do not always need sub-second latency. Most dashboards refresh every 5-15 minutes. If your consumers can tolerate 5-minute data freshness, using a 5-minute trigger interval produces 90% fewer small files and dramatically reduces compaction overhead.\nA production streaming-to-Iceberg pipeline typically includes four components:\nThe most common mistake in streaming Iceberg architectures is deploying the stream processor without the compaction service. Without compaction, query performance degrades within days. Always deploy both together.\nAfter deploying a streaming pipeline, monitor these metrics daily using metadata tables:\nA well-tuned streaming pipeline commits every 1-5 minutes, produces files of 32-128 MB per commit, and has compaction running every 30-60 minutes to consolidate the small files into 256 MB targets.\nPart 14 provides a hands-on walkthrough of Iceberg on Dremio Cloud.", "url": "https://wpnews.pro/news/approaches-to-streaming-data-into-apache-iceberg-tables", "canonical_source": "https://dev.to/alexmercedcoder/approaches-to-streaming-data-into-apache-iceberg-tables-27k5", "published_at": "2026-05-22 16:53:14+00:00", "updated_at": "2026-05-22 17:05:24.603219+00:00", "lang": "en", "topics": ["data", "open-source", "developer-tools", "cloud-computing", "enterprise-software"], "entities": ["Apache Iceberg", "Spark", "Flink", "Kafka", "AWS S3"], "alternates": {"html": "https://wpnews.pro/news/approaches-to-streaming-data-into-apache-iceberg-tables", "markdown": "https://wpnews.pro/news/approaches-to-streaming-data-into-apache-iceberg-tables.md", "text": "https://wpnews.pro/news/approaches-to-streaming-data-into-apache-iceberg-tables.txt", "jsonld": "https://wpnews.pro/news/approaches-to-streaming-data-into-apache-iceberg-tables.jsonld"}}