{"slug": "migrating-to-apache-iceberg-strategies-for-every-source-system", "title": "Migrating to Apache Iceberg: Strategies for Every Source System", "summary": "This article, the final part of a 15-part Apache Iceberg Masterclass, outlines three strategies for migrating existing data to Iceberg: in-place migration, full rewrite, and shadow migration. It explains that the recommended approach for production is the view swap pattern, which uses Dremio's semantic layer to create views that allow for a zero-downtime transition by first pointing consumers to legacy data, then swapping the view to the new Iceberg table after validation.", "body_md": "This is Part 15, the final article of a 15-part Apache Iceberg Masterclass. Part 14 covered hands-on Dremio Cloud. This article covers the three migration strategies and how to execute a zero-downtime migration using the view swap pattern.\nMost organizations do not start with Iceberg. They have years of data in Hive tables, data warehouses, CSV files, databases, and Parquet directories. Moving this data to Iceberg is not an all-or-nothing project. The best migrations happen incrementally, one dataset at a time, with no disruption to existing consumers.\nIn-place migration creates Iceberg metadata over existing Parquet or ORC files without copying or moving them. The data files stay exactly where they are; only new Iceberg metadata is created to track them.\nSpark example:\nCALL system.migrate('db.existing_hive_table')\nThis converts a Hive table to Iceberg by scanning its files and creating the Iceberg metadata tree (metadata.json, manifest list, manifest files) that references them. The Parquet files are untouched.\nPros: Fast. No data movement. The table becomes queryable as Iceberg immediately.\nCons: The existing file layout (sizes, partitioning, sort order) is inherited. If the original files are poorly organized, you inherit those problems. Requires the original files to be in Parquet or ORC format.\nA full rewrite reads data from any source and writes it as a new Iceberg table with optimal partitioning and file sizes:\n-- Spark\nCREATE TABLE iceberg_catalog.analytics.orders\nUSING iceberg\nPARTITIONED BY (day(order_date))\nAS SELECT * FROM hive_catalog.legacy.orders\n-- Dremio\nCREATE TABLE analytics.orders\nPARTITION BY (day(order_date))\nAS SELECT * FROM legacy_source.public.orders\nPros: Best result. Optimal file sizes, correct sort order, proper partitioning. The table is perfectly organized from day one.\nCons: Requires reading and writing all data, which takes time and compute resources. The source system must be available during the migration.\nShadow migration builds the Iceberg table alongside the existing source, then swaps consumers from old to new when ready:\nPros: Zero downtime. Consumers never see a disruption. You can validate the migration before committing to it.\nCons: Temporarily doubles storage costs. Requires maintaining two copies during the transition.\nThe view swap pattern is the recommended approach for production migrations. It uses Dremio's semantic layer to create an abstraction between consumers and the underlying data:\nCreate views in Dremio that point to the legacy data source:\nCREATE VIEW analytics.orders AS\nSELECT order_id, customer_id, order_date, amount, status, region\nFROM postgres_source.public.orders\nAll consumers (dashboards, reports, notebooks) query through these views. They do not know or care where the data physically lives.\nCreate and populate the Iceberg table:\n-- Create the Iceberg table\nCREATE TABLE iceberg_data.analytics.orders (\norder_id BIGINT, customer_id BIGINT,\norder_date DATE, amount DECIMAL(10,2),\nstatus VARCHAR, region VARCHAR\n) PARTITION BY (day(order_date))\n-- Backfill from the legacy source\nINSERT INTO iceberg_data.analytics.orders\nSELECT * FROM postgres_source.public.orders\nCompare the two datasets to confirm data integrity:\nSELECT\n(SELECT COUNT(*) FROM postgres_source.public.orders) AS legacy_count,\n(SELECT COUNT(*) FROM iceberg_data.analytics.orders) AS iceberg_count\nBeyond row counts, validate aggregates (total amounts, distinct customer counts) and spot-check individual records. A comprehensive validation script should compare:\nOnly proceed to the swap after all validation checks pass.\nUpdate the view to point to the Iceberg table:\nCREATE OR REPLACE VIEW analytics.orders AS\nSELECT order_id, customer_id, order_date, amount, status, region\nFROM iceberg_data.analytics.orders\nConsumers notice nothing. The view name is the same. The query interface is the same. But now the data is served from Iceberg with all of its advantages: time travel, hidden partitioning, metadata-driven pruning, and automatic optimization.\nThe view swap pattern enables incremental migration. You do not need to migrate everything at once:\nDuring the transition, Dremio's federation queries legacy and Iceberg tables together. A join between a PostgreSQL table and an Iceberg table works the same as a join between two Iceberg tables. The migration is invisible to consumers.\nAfter migrating each table:\nMigrating without testing query performance: Always benchmark critical queries against the new Iceberg table before switching production traffic. Iceberg's partition layout and file organization affect performance, and a migration can make some queries faster but others slower if the partition strategy is wrong.\nSkipping the validation phase: Data discrepancies between the old and new systems are more common than expected. Schema differences, timezone handling, null semantics, and data type precision can all cause subtle mismatches. Validate thoroughly.\nMigrating everything at once: Large \"big bang\" migrations carry high risk. If something goes wrong, rolling back is complex and time-consuming. Migrate one table at a time, validate each one, and build confidence incrementally.\nThis completes the Apache Iceberg Masterclass. The series covered table formats, metadata, performance, partitioning, writes, catalogs, maintenance, tooling, and migration. For hands-on practice, start a Dremio Cloud trial and follow the workflow in Part 14.", "url": "https://wpnews.pro/news/migrating-to-apache-iceberg-strategies-for-every-source-system", "canonical_source": "https://dev.to/alexmercedcoder/migrating-to-apache-iceberg-strategies-for-every-source-system-424j", "published_at": "2026-05-22 17:48:08+00:00", "updated_at": "2026-05-22 18:05:12.794041+00:00", "lang": "en", "topics": ["data", "open-source", "developer-tools", "cloud-computing", "enterprise-software"], "entities": ["Apache Iceberg", "Dremio", "Spark", "Hive", "Parquet", "ORC"], "alternates": {"html": "https://wpnews.pro/news/migrating-to-apache-iceberg-strategies-for-every-source-system", "markdown": "https://wpnews.pro/news/migrating-to-apache-iceberg-strategies-for-every-source-system.md", "text": "https://wpnews.pro/news/migrating-to-apache-iceberg-strategies-for-every-source-system.txt", "jsonld": "https://wpnews.pro/news/migrating-to-apache-iceberg-strategies-for-every-source-system.jsonld"}}