{"slug": "how-data-lake-table-storage-degrades-over-time", "title": "How Data Lake Table Storage Degrades Over Time", "summary": "This article, part 9 of a 15-part Apache Iceberg Masterclass, explains five ways Iceberg table storage degrades over time without maintenance, including the accumulation of small files, orphan files, and large metadata. It details how operations like appends and updates gradually worsen query performance and provides methods to detect issues like excessive snapshots and skewed partitions before they impact performance.", "body_md": "This is Part 9 of a 15-part Apache Iceberg Masterclass. Part 8 covered embedded catalogs. This article explains the five ways Iceberg table storage degrades and how to detect each problem before it impacts query performance.\nAn Iceberg table that works well on day one will not work well on day 365 without maintenance. Every append, update, and delete operation adds files and metadata. Without periodic cleanup and reorganization, query performance gradually deteriorates until someone notices that a dashboard that used to load in 2 seconds now takes 30.\nThis is the most common and most impactful degradation. Streaming ingestion, micro-batch pipelines, and frequent INSERT operations each create new data files. If these operations produce many small files (under 32 MB), the table accumulates thousands of files where dozens would suffice.\nImpact: Each file becomes a manifest entry. A table with 10,000 small files has 10,000 entries that the query planner must evaluate, compared to 40 entries for the same data in properly-sized 256 MB files. Planning time increases linearly with file count.\nCause: Frequent commits with small amounts of data. A streaming pipeline committing every 30 seconds might add 2-3 files per commit, producing 5,000+ files per day.\nOrphan files are data files that exist in storage but are not referenced by any current or retained snapshot. They accumulate from:\nImpact: Orphan files waste storage space and money. A heavily-written table can accumulate terabytes of orphan files over months. In one common scenario, a daily batch pipeline writing 50 GB per day with weekly compaction can produce 350 GB of orphan files every week. Without cleanup, this costs thousands of dollars annually in storage fees alone.\nEvery commit creates a new snapshot in metadata.json\n. Over time, the metadata file grows as the snapshot list lengthens. The manifest list for each snapshot may also reference many manifest files, especially if the table has been modified in many different partitions.\nImpact: The metadata.json\nfile becomes large, taking longer to download from object storage. At 10,000+ snapshots, the metadata file itself can exceed 100 MB, adding seconds to every query's planning phase. The manifest list grows, making scan planning slower because there are more manifests to evaluate.\nHow to detect it: Check the snapshot count using metadata tables. If it exceeds 1,000, configure snapshot expiry to keep the count manageable.\nIf a table has a declared sort order (e.g., sorted by customer_id\nfor efficient lookups), new data written by different engines or pipelines may not respect this sort order. Over time, the min/max statistics per file widen as new unsorted data is mixed with sorted data.\nImpact: File skipping becomes less effective. As described in Part 3, tight min/max ranges enable file pruning. Wide ranges mean no files can be skipped. A well-sorted table might skip 95% of files for a filtered query, while the same table with decayed sort order might skip only 10%.\nHow to fix it: Run compaction with sorting to rewrite files in the correct order and restore tight min/max ranges.\nSome partitions grow much larger than others. An event table partitioned by day(event_time)\nmight have 10 GB on a normal day but 500 GB during a promotional event. The oversized partition contains files that are too large or too numerous for efficient processing.\nImpact: Queries against skewed partitions are slower because they must process disproportionately more data. Parallel execution becomes unbalanced when one partition's task takes 50x longer than the others.\nConsider a table receiving 100 small appends per day from a streaming pipeline:\nWithout compaction, the table becomes nearly unusable for interactive analytics within 6 months. With daily compaction, the same table stays at 40-50 well-sized files regardless of how many commits happen each day.\nIceberg provides metadata tables that let you inspect table health. Here are the key diagnostic queries:\n-- Average file size\nSELECT AVG(file_size_in_bytes) / 1024 / 1024 AS avg_mb\nFROM TABLE(table_files('analytics.orders'))\nIf average file size is below 32 MB, you have a small file problem. Target: 128-512 MB.\n-- How many snapshots exist?\nSELECT COUNT(*) AS snapshot_count\nFROM TABLE(table_snapshot('analytics.orders'))\nIf snapshot count exceeds 1,000, you should expire older snapshots.\n-- Files per partition\nSELECT partition, COUNT(*) AS file_count\nFROM TABLE(table_files('analytics.orders'))\nGROUP BY partition\nORDER BY file_count DESC\nPartitions with hundreds of files are candidates for compaction.\nDremio supports all Iceberg metadata table queries and provides a SQL interface for monitoring table health.\nEvery Iceberg table in production needs maintenance. The question is not whether to maintain tables but how: manually, through scheduled jobs, or through automated services. Part 10 covers all three approaches in detail.\nThe cost of not maintaining Iceberg tables is both direct (wasted storage from orphan files) and indirect (slow queries leading to poor user experience, excessive cloud compute costs from reading unnecessary data). Organizations with hundreds of Iceberg tables often find that a single data engineer dedicated to table maintenance saves more in compute and storage costs than their salary. Automated maintenance through Dremio or S3 Tables removes this operational burden entirely.", "url": "https://wpnews.pro/news/how-data-lake-table-storage-degrades-over-time", "canonical_source": "https://dev.to/alexmercedcoder/how-data-lake-table-storage-degrades-over-time-47i5", "published_at": "2026-05-22 15:11:54+00:00", "updated_at": "2026-05-22 15:39:43.104567+00:00", "lang": "en", "topics": ["data", "open-source", "developer-tools", "cloud-computing"], "entities": ["Apache Iceberg"], "alternates": {"html": "https://wpnews.pro/news/how-data-lake-table-storage-degrades-over-time", "markdown": "https://wpnews.pro/news/how-data-lake-table-storage-degrades-over-time.md", "text": "https://wpnews.pro/news/how-data-lake-table-storage-degrades-over-time.txt", "jsonld": "https://wpnews.pro/news/how-data-lake-table-storage-degrades-over-time.jsonld"}}