We previously discussed the roots of hive-style partitioning and improving performance by clustering data. From a file system perspective, hive-style partitioning is not designed for cloud object stores. Instead we should focus on multi-dimensional data locality to improve query performance. Delta Lake Liquid Clustering builds upon Z-Order and Hilbert curves to provide more flexibility and better performance.
Parallelism and Un-even Data Distribution
Let’s briefly return the our partitioning and cloud object store example. One of the inherent assumptions made here is that all of the data is evenly partitioned. Each blue line represents a task (e.g., Apache Spark™ job, parallel Rust threads, etc.). If the yearly data partitions are similar size, each of those tasks complete with similar durations.

But for real-world workloads, it is more common to see uneven or even skewed partitions. For larger sized partitions, the task will take a longer time while for smaller sized partitions will be completed more quickly. This is problematic for running parallel workloads as the job will complete based on its slowest task. That is, while the 2004, 2019, 2020, and 2021 will have already completed, the job itself will not finish until the 2022 and 2023 tasks are finished.

Note, Adaptive Query Execution (AQE) was introduced with Apache Spark 3.0. In addition to faster performance, it can also handle skewed datasets. For more information, please refer to What’s new in Apache Spark 3.0: Xiao Li and Denny Lee and Tech Talk: Top Tuning Tips for Spark 3.0 and Delta Lake.
Liquid Clustering Optimizes Data Sizes
Instead of being entirely reliant on the query engine to optimize for skew, Delta Lake Liquid Clustering addresses this at the file level. Intuitively, we should combine the smaller years together to avoid the small file problem.

At the same time, we should divide the larger years to prevent long running tasks or queries.

Applying Liquid Clustering to two axes
The previous simplified example focuses on a single column (year). But as noted previously, we can apply clustering to multiple columns. To simplify the diagram, let’s focus on two columns (or dimensions): year and customer. The diamonds represent t-shirt data sizes of S (red), M (yellow), L (green/optimal), and XL (blue).

Combining across two axes
With our S (red) and M (yellow) data sizes, with Liquid Clustering we can automatically combine across both dimensions to create optimal L data sizes (green).

The result is fewer data files ultimately requiring less resources to query the data.

Dividing across two axes
For larger datasets such as this enterprise customer, Liquid Clustering will divide the XL files into L (optimal) files.

The result are smaller sets of files (green) to avoid file-based long running queries.

Discussion
Here we have shown how Delta Lake Liquid Clustering can create the optimal size of a Delta table’s underlying files. This technique is used in combination with data skipping and Z-Order to optimize your query performance across multiple dimensions (columns).
For more information on Liquid Clustering, please watch this Delta Lake Deep Dive webinar on Liquid Clustering featuring Vítor Teixeira, Sr. Data Engineer at Veeva Systems. He also published the blog post: Delta Lake — Partitioning, Z-Order and Liquid Clustering.

Leave a Reply