How Did We Get Here? A Brief History

Relational Database Management Systems (RDBMS) have been a standard option for organizations for storing transactional data. RDBMSs used for storing transactional data support a specific data processing category called Online Transaction Processing (OLTP). Examples of OLTP-optimized RDBMSs are PostgreSQL, MySQL, and Microsoft SQL Server.

These systems are built for quick interactions with small amounts of data, usually just one or a few rows at a time. But they’re not great for analytical queries—trying to pull large-scale insights from them can cause serious performance slowdowns. Here’s why:

Transactional systems focus on inserting, updating, and retrieving small sets of rows. In contrast, analytics systems work by aggregating columns or scanning entire tables, which is where a columnar structure shines.
Running both transactional and analytical workloads on the same system can create resource conflicts, slowing everything down.
Transactional databases are optimized with normalized data spread across multiple related tables, making joins efficient when needed. On the other hand, analytics workloads often perform better with denormalized data in a single table, reducing the need for complex joins.

Systems that deals with complex aggregate queries are known as Online Analytical Processing (OLAP).

Foundational Components of a System Designed for OLAP Workloads

A system designed for OLAP workloads has the following structure:

Storage. It’s a systems that allows you to store large significant amounts of data. Some options are:

local filesystem on direct-attached storage (DAS)
a distributed filesystem on a set of nodes (such as Hadoop Distributed File System [HDFS])
an object storage provided as a service by cloud providers (such as Amazon S3).

File Format. For storage purposes, your raw data needs to be organized in a particular file format. The choice of file format impacts things such as the compression of the files, the data structure, and the performance of a given workload. File formats generally fall into three high-level categories:

structured (CSV)
semistructured (JSON)
unstructured (text files)

Within the structured and semi-structured categories, file formats can be:

Row-oriented: Stores all columns of a specific row together.
- Examples: CSV, Apache Avro

Column-oriented (columnar): Stores all rows of a specific column together.
- Examples: Apache Parquet, Apache ORC

Table format. It’s crucial for systems that handle analytical workloads and large volumes of data. It acts as a metadata layer above the file format, specifying how data should be stored. Its goal is to abstract the complexity of the physical data structure and support operations like inserts, updates, deletes, and schema changes. Modern table formats also ensure atomicity and consistency during these operations.

Storage engine. It’s à§Mresponsible for organizing and updating data according to the table format. It handles key tasks like data optimization, index maintenance, and the removal of outdated data.

Catalog. It helps users and compute engines quickly locate necessary datasets because, as data size grows, it is important to quickly identify the data you might need for your analysis. A catalog’s role is to tackle this problem by leveraging metadata to identify datasets. By using metadata, the catalog provides information like table names, schemas, and storage locations. Catalogs can be internal to systems (e.g., Postgres, Snowflake) or open for use by multiple systems (e.g., Hive, Project Nessie). It’s important to note that these metadata catalogs differ from those used for human data discovery, such as Colibra or Atlan.

Compute engine. It’s essential for processing large amounts of data stored in a system. Depending on the data volume and computational needs, multiple compute engines may be used. For massive datasets or heavy workloads, a distributed compute engine is necessary, utilizing massively parallel processing (MPP). Examples of MPP-based engines include Apache Spark, Snowflake, and Dremio.

Bringing It All Together

Traditionally, OLAP workloads have been managed by a data warehouse, where all the technical components (like storage and compute engines) are tightly integrated. Data warehouses allow organizations to store data from various sources and perform analytical tasks on top of the data itself. The next section will explore data warehouse capabilities, integration of components, and the pros and cons of using such a system.

The Data Warehouse

A data warehouse is a centralized repository that stores large volumes of data from various sources, such as operational systems, application databases, and logs. Here’s an architectural overview of a data warehouse:

It integrates all technical components into a single system, with data stored in proprietary file and table formats on its own storage system. The data is managed by the warehouse’s storage engine, registered in its catalog, and can only be accessed through its compute engine by users or analytical tools. That’s simply a limitation of Data Warehouses.

A brief History

Before 2015, traditional data warehouses were primarily designed to run on-premises, with storage and compute resources tightly coupled on the same hardware. This tight integration meant that both components scaled together, regardless of actual needs. As data volumes and processing workloads grew rapidly, this setup posed significant challenges. For example, if you needed more storage but not additional compute power, you were still forced to purchase both—leading to inefficiencies and unnecessary costs.

These limitations pushed the evolution toward a new generation of cloud-focused data warehouses. Around 2015, the rise of cloud-native architectures enabled a more flexible model: storage and compute could be scaled independently. This meant organizations could adjust each resource based on specific demands, leading to better performance and cost-efficiency. Moreover, cloud warehouses introduced the ability to turn off compute resources when not in use—without affecting stored data—further optimizing resource use and cost.

Pros and Cons of a Data Warehouse

Data warehouses, whether deployed on-premises or in the cloud, provide organizations with a centralized repository for storing historical data from a wide variety of sources. This centralized setup allows analysts, business intelligence (BI) engineers, and other data consumers to access and analyze data efficiently from a single, unified source. Thanks to powerful underlying technologies, data warehouses are well-equipped to handle large-scale BI workloads, supporting data-driven decision-making based on historical insights.

However, despite these benefits, data warehouses come with notable limitations:

Limited to Relational Workloads. Data warehouses are primarily optimized for structured, relational data and SQL-based analysis. They aren’t designed to support machine learning (ML) workloads natively. If an organization wants to do more advanced tasks like predictive modeling or forecasting, it has to export data to other platforms better suited for ML. This not only introduces redundant copies of data but also necessitates complex data pipelines, increasing the risk of data drift (inconsistencies over time) and model decay due to poorly synchronized data.
Lack of Support for Non-Structured Data. Data warehouses do not natively handle semistructured (like JSON) or unstructured data (like text or images)—types increasingly used in modern analytics, especially ML applications. For instance, understanding customer sentiment from booking reviews requires analyzing text data, something a warehouse can’t process directly.
Closed Architecture and Vendor Lock-In. Most data warehouses have a closed architecture, meaning the data is stored in proprietary table and file formats, tightly coupled to the warehouse’s compute engine. This makes data accessible only through that specific platform, creating vendor lock-in. Over time, as more data and workloads accumulate in the warehouse, migrating to another platform becomes increasingly difficult and costly, stifling flexibility in choosing tools that best serve evolving business needs.
High and Increasing Costs. Using a data warehouse entails significant operational costs—both for storage and especially for compute, which scales up as more workloads are introduced. Beyond infrastructure, there are human and organizational costs, such as the need for engineering teams to build and maintain pipelines, and delays in delivering insights to business users due to pipeline complexity or system limitations.

These issues—especially the inflexibility, rising costs, and lack of ML/data diversity support—have motivated many organizations to explore alternatives. In particular, there’s a growing interest in Data Lakes, which offer open file formats and more flexibility, allowing BI and ML workloads to coexist and operate independently, all while keeping costs more manageable and control in the hands of the organization.

Data Lake

Data lakes emerged as an alternative to traditional data warehouses, mainly to address their limitations. While data warehouses excel at processing structured data, they struggle in areas such as:

Storing semistructured or unstructured data (e.g., JSON, images, text).
High storage costs, especially compared to on-prem Hadoop clusters or modern cloud object storage.
Coupled storage and compute, particularly in on-premises environments, which forces organizations to pay for more compute power even if only storage needs increase.

To overcome these issues, data lakes were introduced as a more flexible and cost-effective solution that could store all types of data without requiring them to conform to a rigid schema upfront.

A brief History

Initially, data lakes were implemented using the Hadoop ecosystem to store and process large amounts of structured and unstructured datasets across clusters of inexpensive computers. Hadoop Ecosystem is a collection of open-source tools designed for distributed computing:

HDFS (Hadoop Distributed File System) handled large-scale storage across low-cost hardware clusters.
MapReduce allowed users to run analytics on that data, though it required writing complex and verbose Java code.
Since many users preferred SQL over Java, Hive was developed to bridge the gap—translating SQL into MapReduce jobs.

Hive also introduced the Hive table format, which interpreted a directory and the file inside it as a table. This innovation enabled SQL-based querying on top of a data lake, and it became a de facto standard for recognizing files within distributed file systems as singular tables.

As technology evolved, organizations began moving away from on-premises Hadoop setups to cloud object storage like Amazon S3, MinIO, or Azure Blob Storage. These platforms offered:

Lower costs and easier management.
Greater scalability and availability.

Meanwhile, MapReduce became obsolete, replaced by more powerful and user-friendly distributed query engines, such as Apache Spark, Presto, and Dremio.

Despite this shift, the Hive table format persisted, continuing to provide a method for interpreting directory-based storage as structured tables. However, it wasn’t designed for cloud environments, and its dependence on hierarchical folder structures led to inefficient network usage, especially in high-latency cloud storage settings.

One of the defining features of a data lake is the ability to use multiple compute engines for different types of workloads. This flexibility is crucial because no single compute engine can efficiently handle every kind of data processing. Each engine involves trade-offs, and being able to select the right one for a given task allows organizations to optimize performance and cost.

However, unlike a data warehouse, a data lake doesn’t usually have a dedicated storage engine that continuously manages and optimizes data. Instead:

The compute engine controls how data is written to storage.
Once written, data is often not re-optimized, unless entire tables or partitions are manually rewritten, which is typically done sporadically.

The following figure depicts how the components of a data lake interact with one another:

Pros and Cons of a Data Lake

Data lakes offer several distinct advantages that make them attractive for modern data architectures, especially when compared to traditional data warehouses:

Lower Cost. Data lakes are significantly more cost-effective for both storage and query execution. This makes them ideal for analyzing large volumes of data that might not be important enough to justify the high cost of data warehouse processing.
Use of Open File Formats. Unlike data warehouses, which often lock data into proprietary formats, data lakes allow storage in any format, including open formats like Parquet, ORC, Avro, or even plain text. This flexibility:
- Gives organizations greater control over their data.
- Enhances tool compatibility, since many analytics, machine learning, and processing tools are designed to work with these open standards.
Support for Unstructured Data. Data lakes can store and process not only structured data, but also unstructured and semistructured data.

Limitations:

Performance Limitations Because a data lake’s components (storage, file format, table format, compute engine) are loosely coupled, they lack many of the built-in optimizations found in traditional data warehouses. For example:
- No built-in indexing
- No inherent support for ACID transactions
To achieve warehouse-level performance, you would need to manually integrate and fine-tune all components, which demands significant engineering effort. This makes data lakes less suitable for low-latency or high-priority analytics.
Complex Configuration Requirements. Creating a well-optimized data lake often requires assembling and configuring multiple tools. Without a unified system, it’s up to the engineering team to orchestrate how data is stored, read, processed, and managed.
Lack of Built-in ACID Transactions. Traditional relational systems ensure ACID properties (Atomicity, Consistency, Isolation, Durability), but data lakes typically do not offer these guarantees natively. Instead, they ingest data using schema-on-read, where validation happens only during query time, not during ingestion. This approach introduces risks in applications requiring strict data integrity and consistency.

Should I Run Analytics on a Data Lake or a Data Warehouse?

Even though data lakes are good for storing all types of data—structured and unstructured—they still have some problems when it comes to running analytics. After landing data into the data lake using ETL (extract, transform, and load), there are usually two options for analytics:

Send a curated subset of the data to the data warehouse. This is for high-priority analytics, where you need better performance and the flexibility of the warehouse.
But this causes a few problems:
- You spend extra on compute for the new ETL pipeline and pay again for storing data in the warehouse, where storage is more expensive.
- You end up creating many copies of the same data—for example, one for each business unit (data marts), and more copies when analysts export parts of the data for BI dashboards. This leads to a messy situation where it’s hard to track, manage, and keep all copies in sync.
Run queries directly on the data lake using engines like Dremio, Presto, Spark, Trino, or Impala. These are generally good for read-only queries. But when it comes to updating data, they struggle because the Hive table format doesn’t handle updates well and adds complexity.

In the end, both data lakes and warehouses have their strengths and weaknesses. That’s why a new type of architecture was created—called the Data Lakehouse—to combine the best parts of both and reduce the downsides.

The Data Lakehouse

Data warehouses are great for performance and ease of use. On the other hand, data lakes give you lower cost, support for open formats, and the ability to work with unstructured data. The Data Lakehouse came from the need to combine the benefits of both.

A data lakehouse keeps the storage and compute separation of data lakes but adds features that make it behave more like a data warehouse. This includes ACID transactions, better performance, and more consistency. What makes this possible are new table formats that fix the problems of the older Hive format. With a data lakehouse, you still:

Store your data in the same places as a data lake,
Use the same compute engines,
Store data in the same file formats.

What changes everything is the table format, which adds a metadata layer between the compute engine and the storage. This makes their interaction smarter and brings new value:

Fewer copies = less drift. Because of better performance and ACID guarantees, you can now do updates and data manipulation right in the lakehouse—something you usually needed a warehouse for. This means fewer data copies, which reduces storage and compute costs, avoids data drift (differences between copies), and improves governance.
Faster queries = faster insights. Since the goal is to get business insights quickly (comparable to data warehouses), faster queries mean faster results. This is due to:
- the engine (with optimizers and caching),
- the table format (with metadata for better query planning),
- and the file format (using sorting and compression).
Historical data snapshots = fewer problems. Lakehouse table formats support snapshots of your data. That means you can go back to an earlier state if something goes wrong—no panic, no major rework.
Affordable architecture = better value. Lakehouses can help you lower costs (by avoiding duplication and extra ETL) while still helping you get insights to increase revenue. They use cheaper storage and compute compared to traditional warehouses.
Open architecture = no vendor lock-in. Data lakehouses use open formats like Apache Iceberg (table format) and Apache Parquet (file format). Many tools can use these formats, so you avoid being stuck with one vendor or tool, giving you flexibility and peace of mind.

Here’s the technical components of a data lakehouse:

In short, modern data lakehouses give you the best of all worlds by working directly on the data lake. What makes this possible is the table format that brings performance and guarantees that data lakes didn’t have before.

What is a Table Format?

A table format is a way to organize a dataset’s files so they look and work like a single “table”. From the user’s point of view, it answers the question: “what data is in this table?”

This clear definition makes it possible for different people, teams, and tools to read from or write to the same table at the same time. The main goal of a table format is to provide an abstraction of the table to users and tools, making it easier for them to interact with the underlying data efficiently.

Table formats aren’t new. They’ve been around since early relational databases like System R, Multics, and Oracle. These systems were based on Edgar Codd’s relational model. Back then, people didn’t use the term table format, but the idea was the same: users worked with “tables,” and the database engine took care of organizing the actual files, managing things like transactions and storage layout. In those databases, only the database engine could interact with the data files. If something else tried to access them, it could break the whole system. Users didn’t need to know where the data was stored or how—it was all handled for them.

But in today’s big data world, this idea doesn’t work anymore. Now we use many different compute engines depending on what we’re doing—some are better for BI, others for ML, and so on. That means we can’t rely on just one engine to manage and access all the data. In a data lake, data is stored as files in cloud storage like Amazon S3, Azure Data Lake Storage (ADLS), or Google Cloud Storage (GCS). A single table could be made up of thousands or even millions of files. When using SQL or writing code in languages like Java, Scala, Python, or Rust, we don’t want to manually list all the files that belong to a table. It would take too much time and would likely cause inconsistencies.

So the solution was to create a standard way to define what data belongs to the table in a data lake. This is what a table format does, as shown in the following figure:

Hive: The Original Table Format

In the early days of running analytics on Hadoop data lakes, people used the MapReduce framework. But writing MapReduce jobs was hard—it required complex Java code, which many analysts didn’t know how to write. To solve this, Facebook created Hive in 2009. Hive made it easier to do analytics on Hadoop by letting users write SQL instead of Java code. The Hive framework would take your SQL statements and turn them into MapReduce jobs that could run on Hadoop. But for this to work, Hive needed a way to know which files in storage belonged to which table. That’s why the Hive table format and the Hive Metastore were created.

The Hive table format worked by treating all files in a folder (or prefix, for object storage) as a single table. Subfolders were used as partitions of that table. The Hive Metastore kept track of the folder paths, so query engines could know where to find the right data. This is illustrated in this figure:

Some benefits of the Hive table format:

It allowed more efficient queries using partitioning (splitting data by a key) and bucketing (grouping data using a hash function), which helped avoid full table scans.
It was file format agnostic, so it allowed the data community over time to develop better file formats, such as Apache Parquet.
You could do atomic swaps of whole partitions in the Hive Metastore—this meant changes to a partition would be all-or-nothing (atomic).
It became the standard used by many tools, giving a common answer to the question: “what data is in this table?”

But over time, people found many limitations:

You couldn’t make atomic changes at the file level, only at the partition level.
It wasn’t possible to update multiple partitions in one atomic operation. This could cause inconsistent data during updates.
It didn’t support concurrent updates well, especially when using tools other than Hive.
Query engines had to list all files and directories, which was slow and could affect query speed.
Partition columns were often derived from other columns (e.g., getting the month from a timestamp). If users didn’t filter by the partition column, they might accidentally do a full table scan.
Table statistics were created by separate jobs, and often those stats were out of date or missing, making query optimization harder.
Cloud object storage can slow down when too many requests are made to the same prefix (like a folder). This caused issues when too many files were in a single partition.

As datasets and use cases grew larger, these problems got worse. This led to the creation of newer table formats to solve these issues.

Modern Data Lake Table Formats

To fix the problems of the Hive table format, a new generation of table formats was created. These new formats took a different approach to solve the same issues.

The creators of these modern formats realized the main problem with the Hive format was that it defined a table based on directories, not on the individual data files. Modern table formats like Apache Iceberg, Apache Hudi, and Delta Lake instead define tables as a list of files. They use metadata to tell the engine which files are part of the table, instead of relying on folders. This more detailed way of defining a table made it possible to support new features like ACID transactions, time travel, and more.

All modern table formats focus on giving some key improvements over the Hive table format:

They support ACID transactions—safe transactions that either complete fully or not at all. Hive couldn’t guarantee this in many cases.
They allow multiple writers to safely write to the same table. If two or more users write data at the same time, there’s a system to make sure the second one knows what the first did, so the data stays correct.
They provide better metadata and table statistics, which help query engines scan fewer files and run more efficiently.

What Is Apache Iceberg?

Apache Iceberg is a table format created in 2017 by Ryan Blue and Daniel Weeks at Netflix. It was built to fix issues like performance and consistency, which were common with the Hive table format. In 2018, the project became open source and was donated to the Apache Software Foundation. Many companies started contributing to it, including Abxzpple, Dremio, AWS, Tencent, LinkedIn, and Stripe.

How Apache Iceberg Came to Be

Netflix found that the foundational big problem with the Hive format was that tables were tracked using folders and subfolders. This made it hard to have features like consistency, concurrent updates, and other functions that data warehouses usually support.

Because of this, Netflix started to build a new format with a few main goals:

Consistency. If you update a table in more than one partition, users should not see data in a broken or incomplete state. They should either see the old version or the new version of the data—not something in between.
Performance. In Hive, the system had to list all files and folders, which made query planning slow. Iceberg wanted to avoid this and use metadata instead, so the system can plan queries faster and only scan the files it really needs.
Easy to use. Users shouldn’t need to know the physical layout of the data to get the benefits of partitioning. For example, if someone filters by a timestamp, they shouldn’t have to also filter by a month column that was derived from that timestamp.
Evolvability. Changing the schema or partitioning in Hive was risky and often meant rewriting the whole table. Iceberg wanted to let users make changes safely and easily, without full rewrites.
Scalability. All of the above should work even with huge datasets (Netflix deals with petabytes of data).

To achieve this, Iceberg tracks a table as a list of files, not as a list of folders. Apache Iceberg is a specification, which means it’s a standard way to write and manage table metadata in a data lakehouse.

To help people and engines work with Iceberg, the project includes subaries, and implementations exist for popular engines like Apache Spark and Apache Flink.

Iceberg was designed to work well with existing tools and storage solutions, so it can be added into the current ecosystem without needing to replace everything. The idea is to make Apache Iceberg a standard that all engines understand, so users don’t even have to think about it. This is already happening—many tools now work with Iceberg tables easily, and in the future, even data engineers might not have to deal with the table format directly. They’ll just work with data lakes like they do with data warehouses.

The Apache Iceberg Architecture

Apache Iceberg keeps track of things like a table’s schema, partitioning, sorting, and even changes over time using a tree of metadata. This makes it much faster for engines to plan queries compared to old-school data lake methods. The following figure depicts this tree of metadata:

This metadata tree has four main parts:

Manifest File. This is a list of data files. Each entry includes the file’s path and important details (metadata) about that file. This helps the system make faster and smarter decisions about which files to read.
Manifest List. This is a file that represents a single version (or snapshot) of the table. It contains a list of manifest files, plus some summary stats. These help with planning efficient queries on that specific snapshot.
Metadata File. This file defines the overall structure of the table. It includes the schema, partition layout, and a list of all the snapshots of the table.
Catalog. The catalog helps track where the table lives. It maps the table name to the location of its latest metadata file. This is kind of like the Hive Metastore, but more modern. Different tools can be used as the catalog—Hive Metastore itself, among others.

Key Features of Apache Iceberg

Apache Iceberg introduces a set of modern features that go far beyond just fixing problems with older systems like Hive. Its architecture supports powerful capabilities that are especially useful for managing large-scale data lakes and data lakehouse setups.

ACID Transactions

One of the core features of Apache Iceberg is support for ACID transactions—which stands for Atomicity, Consistency, Isolation, and Durability. These are properties that ensure your data remains correct and reliable, even when multiple processes are reading and writing to the data at the same time.

Iceberg uses something called optimistic concurrency control to manage these transactions:

This model works under the assumption that most transactions won’t conflict with each other.
Instead of using locks (like some databases do), Iceberg checks for conflicts only when needed—at commit time.
If there’s a conflict, the transaction fails and doesn’t affect the data. This keeps performance high while still protecting data integrity.

In contrast, a pessimistic concurrency model (which uses locks to prevent problems ahead of time) wasn’t available in Iceberg at the time of writing, but might be added in the future.

Importantly, the system relies on a catalog (such as a metadata store) to handle these transactions. The catalog is responsible for tracking the state of the table and making sure updates don’t conflict. This ensures:

Transactions are atomic—they either fully succeed or don’t happen at all.
Different tools or systems using the table won’t cause inconsistent data or accidental overwrites.

Without this catalog-based control, there’s a risk that two different systems could update the same table in conflicting ways, which could result in data loss.

Partition Evolution

Before Iceberg, one major problem with data lakes was how difficult it was to change how a table is partitioned (which is a way of physically organizing data for faster querying). If you wanted to change the partitioning strategy:

You would often have to rewrite the entire table, which can be extremely expensive and time-consuming—especially for large datasets.
As a result, people often just accepted the limitations of the original partitioning, sacrificing performance.

Iceberg solves this problem by allowing partition evolution:

You can change the partitioning of a table at any time, without having to rewrite the existing data.
This is possible because Iceberg separates the metadata (information about how the data is organized) from the actual data files.
So, when you change the partitioning scheme, Iceberg only updates metadata—not the data itself—which makes it fast and cost-effective.

An example provided in the text shows a table that originally used monthly partitioning and was later updated to use daily partitioning. In this case:

The older data remains organized by month.
New data is added using daily partitions.
During queries, the engine understands the mixed partitioning and creates a plan that respects the specific partitioning scheme of each part of the data.

This flexible system gives you the performance benefits of better partitioning strategies without the heavy cost of rewriting large datasets.

Hidden Partitioning

In traditional data lake systems, especially Hive, users often had to understand how a table was physically partitioned in order to write efficient queries. If you didn’t, your query might scan the entire dataset—even if it could’ve been optimized.

Problem in older systems:

Suppose your data is partitioned using three separate fields like event_year, event_month, and event_day.
A typical user might just want to filter with a condition like:
event_timestamp >= DATE_SUB(CURRENT_DATE, INTERVAL 90 DAY)
Although this seems logical, Hive wouldn’t understand that the timestamp filter maps to those partition fields. As a result, it performs a full table scan, ignoring the partitions.

How Iceberg solves this:

Iceberg supports partition transforms, which let you partition by a column and apply a function to it—like year, month, day, hour, truncate, or bucket.
This means you can partition by a timestamp field directly, using a transform like day(event_timestamp).
Now, even if your query filters directly on event_timestamp, Iceberg knows how to optimize it using its metadata, avoiding full table scans.
Users no longer need to know or manually filter by helper columns like event_year or event_day. This is what’s meant by “hidden partitioning”—the complexity of physical partitioning is hidden from the user, making queries both easier and faster.

Row-Level Table Operations

Apache Iceberg supports two strategies for handling row-level changes, such as updates and deletes:

Copy-on-Write (COW):
- If a single row in a file changes, the entire file is rewritten with that change included.
- This is simpler but can be costly for heavy update workloads.
Merge-on-Read (MOR):
- Instead of rewriting the entire file, Iceberg creates a new smaller file that includes only the changes.
- When reading the data, Iceberg combines the base file with the changes.
- This method is more efficient for frequent updates and deletes, and gives users the flexibility to pick a strategy based on workload needs.

Time Travel

Another powerful feature of Iceberg is time travel, which allows you to access the state of a table at any point in the past.

Iceberg creates immutable snapshots of the table every time a change is committed.
These snapshots can be queried directly, letting you:
- Reproduce past reports (e.g., quarter-end financial summaries).
- Re-run machine learning models on historical data.
- Investigate how a table looked before certain changes were made.
You don’t need to copy the data to another location to do this—just query the snapshot.

Version Rollback

In addition to viewing past snapshots (time travel), Iceberg also lets you revert a table back to any of those previous versions.

This is especially useful if a mistake is made—like a bad data load or schema change.
Rolling back is fast and safe since snapshots already contain all necessary metadata and file references.
The table’s current state is simply pointed back to the older snapshot, without moving or copying files.

Schema Evolution

Data doesn’t stay the same forever, and neither should your table schema. Iceberg supports flexible schema evolution, allowing you to make safe and controlled changes over time:

You can add or remove columns, rename them, or even change data types.
- For example, converting a column from int to long if the values start to grow larger.
These changes don’t break the table or existing queries because Iceberg tracks all schema versions in its metadata.
This makes it easier to maintain tables in real-world situations where requirements and data formats often evolve.

Quartz 4

Explorer

1. Introduction to Apache Iceberg

How Did We Get Here? A Brief History

Foundational Components of a System Designed for OLAP Workloads

Bringing It All Together

The Data Warehouse

A brief History

Pros and Cons of a Data Warehouse

Data Lake

A brief History

Pros and Cons of a Data Lake

Should I Run Analytics on a Data Lake or a Data Warehouse?

The Data Lakehouse

What is a Table Format?

Hive: The Original Table Format

Modern Data Lake Table Formats

What Is Apache Iceberg?

How Apache Iceberg Came to Be

The Apache Iceberg Architecture

Key Features of Apache Iceberg

ACID Transactions

Partition Evolution

Hidden Partitioning

Row-Level Table Operations

Time Travel

Version Rollback

Schema Evolution

Graph View

Table of Contents