Denny Lee

How Delta Lake Liquid Clustering conceptually works

We previously discussed the roots of hive-style partitioning and improving performance by clustering data. From a file system perspective, hive-style partitioning is not designed for cloud object stores. Instead we should focus on multi-dimensional data locality to improve query performance. Delta Lake Liquid Clustering builds upon Z-Order and Hilbert curves to provide more flexibility and better performance.

Parallelism and Un-even Data Distribution

Let’s briefly return the our partitioning and cloud object store example. One of the inherent assumptions made here is that all of the data is evenly partitioned. Each blue line represents a task (e.g., Apache Spark™ job, parallel Rust threads, etc.). If the yearly data partitions are similar size, each of those tasks complete with similar durations.

Data partitions of similar size help query parallelism
Data partitions of similar size help query parallelism

But for real-world workloads, it is more common to see uneven or even skewed partitions. For larger sized partitions, the task will take a longer time while for smaller sized partitions will be completed more quickly. This is problematic for running parallel workloads as the job will complete based on its slowest task. That is, while the 2004, 2019, 2020, and 2021 will have already completed, the job itself will not finish until the 2022 and 2023 tasks are finished.

Data partitions of different sizes skew query parallelism
Data partitions of different sizes skew query parallelism

Note, Adaptive Query Execution (AQE) was introduced with Apache Spark 3.0. In addition to faster performance, it can also handle skewed datasets. For more information, please refer to What’s new in Apache Spark 3.0: Xiao Li and Denny Lee and Tech Talk: Top Tuning Tips for Spark 3.0 and Delta Lake.

Liquid Clustering Optimizes Data Sizes

Instead of being entirely reliant on the query engine to optimize for skew, Delta Lake Liquid Clustering addresses this at the file level. Intuitively, we should combine the smaller years together to avoid the small file problem.

Combine small years together to avoid the small file problem
Combine small years together to avoid the small file problem

At the same time, we should divide the larger years to prevent long running tasks or queries.

Divide larger years to prevent long-running tasks or queries
Divide larger years to prevent long-running tasks or queries

Applying Liquid Clustering to two axes

The previous simplified example focuses on a single column (year). But as noted previously, we can apply clustering to multiple columns. To simplify the diagram, let’s focus on two columns (or dimensions): year and customer. The diamonds represent t-shirt data sizes of S (red), M (yellow), L (green/optimal), and XL (blue).

Non-clustered layout data layout for customer and year.
Non-clustered layout data layout for customer and year.

Combining across two axes

With our S (red) and M (yellow) data sizes, with Liquid Clustering we can automatically combine across both dimensions to create optimal L data sizes (green).

Combining smaller files across two dimensions
Combining smaller files across two dimensions

The result is fewer data files ultimately requiring less resources to query the data.

The result of combining smaller files across two dimensions
The result of combining smaller files across two dimensions

Dividing across two axes

For larger datasets such as this enterprise customer, Liquid Clustering will divide the XL files into L (optimal) files.

Dividing larger files across two dimensions
Dividing larger files across two dimensions

The result are smaller sets of files (green) to avoid file-based long running queries.

The result of dividing larger files across two dimensions
The result of dividing larger files across two dimensions

Discussion

Here we have shown how Delta Lake Liquid Clustering can create the optimal size of a Delta table’s underlying files. This technique is used in combination with data skipping and Z-Order to optimize your query performance across multiple dimensions (columns).

For more information on Liquid Clustering, please watch this Delta Lake Deep Dive webinar on Liquid Clustering featuring Vítor Teixeira, Sr. Data Engineer at Veeva Systems. He also published the blog post: Delta Lake — Partitioning, Z-Order and Liquid Clustering.

One response to “How Delta Lake Liquid Clustering conceptually works”

  1. […] next post How Delta Lake Liquid Clustering conceptually works describes how Liquid clustering provides even more flexibility and […]

Leave a Reply

Discover more from Denny Lee

Subscribe now to keep reading and get access to the full archive.

Continue reading