Denny Lee

What is the Delta Lake Transaction Log?

Delta Lake is an open-source storage framework that enables you to build a Lakehouse architecture quickly. Pair Delta Lake with your favorite compute engine, such as Spark, PrestoDB, Flink, Trino, and Hive. Ensure the reliability of your tables with the API of your choice, including Scala, Java, Rust, or Python. The Delta Lake transaction log (also known as the Delta Log) is an ordered record of every transaction that has been performed on a Delta Lake table since its inception.

Single source of truth

Delta Lake allows multiple readers and writers of a given table to all work on the table at the same time. To show users correct views of the data at all times, the Delta log is a single source of truth. It is the central repository that tracks all user changes to the table. This concept is important because, over time, processing jobs will fail in your data lake.  The result is partial files that are not removed.  Subsequent processing or queries will not be able to ascertain which files should or should not be included in their queries.

Partial file example on a data lake

The following image is a common data processing scenario where the table is represented by two Parquet files (i.e., 1.parquet and 2.parquet) at t0.

At t1, job 1 extracts file 3 and file 4 and writes them to disk.  However, due to some error (e.g., network hiccup, storage temporarily offline, etc.), an incomplete part of file 3 and none of file 4 are written into 3.parquet (red).  Thus, 3.parquet is a partial file. Incomplete data will be returned to any clients that subsequently query this table.

Data lake partial file example
Data lake partial file example

To complicate matters, at t2, a new version of the same processing job (job 1 v2) successfully completes its task. It generates a new version of 3.parquet and 4.parquet (blue). But because the partial 3'.parquet (red) exists alongside with 3.parquet there will be double counting.

How does the Delta log solve this problem?

By using a transaction log to track which files are valid, we can avoid the preceding scenario.  Thus, when a client reads a Delta Lake table, the engine (or API) initially verifies the transaction log to see what new transactions have been posted to the table. It then updates the client table with any new changes. This ensures that any client’s version of a table is always synchronized. Clients cannot make divergent, conflicting changes to a table. Let’s repeat the same partial file example on a Delta Lake table.

Partial file example on Delta Lake

The following image is the same preceding scenario where the table is represented by two Parquet files (i.e., 1.parquet and 2.parquet) at t0. The transaction log records that these two files make up the Delta table at t0 (Version 0).

Delta Lake partial file example
Delta Lake partial file example

At t1, job 1 fails with the creation of 3.parquet (red).  However, because the job failed, the transaction was not committed to the transaction log. No new files are recorded; notice how the transaction log only has 1.parquet and 2.parquet listed. Any queries against the Delta table at t1 will only read these two files, even if other files are in the folder.

At t2, job 1 v2 completed, and the output is the files 3.parquet and 4.parquet. Because the job was successful, the Delta Log includes entries for only the two successful files. That is, 3'.parquet (red) is not included in the log. Therefore, any clients querying the Delta table at t2 will only see the correct files.

Atomicity in Delta Lake

Atomicity is one of the four properties of ACID transactions that guarantees that operations performed on your table either complete fully or don’t complete at all. Without this property, it’s far too easy for a hardware failure or a software bug to cause data to be only partially written to a table, resulting in messy or corrupted data.

The transaction log is how Delta Lake offers the guarantee of atomicity. Succinctly, if it’s not recorded in the transaction log, it never happened. It only records transactions that execute fully and completely. And all clients querying a Delta table will use this log as the single source of truth. The Delta log allows users to reason about their data and have peace of mind about its fundamental trustworthiness at petabyte scales.

Addendum

To understand the transaction log specification, refer to the Delta Lake Protocol specification. By following this specification, any API can read from and write to a Delta Lake table.

Diving into Delta Lake: Unpacking the Transaction Log with Burak Yavuz and Denny Lee

If you want to dive deeper, enjoy Unpacking the Transaction Log by Burak Yavuz and me.

3 responses to “What is the Delta Lake Transaction Log?”

  1. […] the past few months, we have explored what is the Delta Lake transaction log, understanding the transaction log at the file level, and we also took a peek into the transaction […]

  2. […] we had discussed how the Delta Lake transaction log works at a high level, now let’s talk about concurrency! In the previous posts, our examples […]

  3. […] while providing the scalability and flexibility of data lakes. Delta Lake achieves through its transaction log as noted in the following diagram below of rust-lang client querying a Delta […]

Leave a Reply

Discover more from Denny Lee

Subscribe now to keep reading and get access to the full archive.

Continue reading