Optimize by Clustering not Partitioning Data with Delta Lake

January 29, 2024

The raison d’être for Hive-style partitioning is to optimize physical disk and data layout as discussed in Organizing Data: Partition by Disk to Hive-style Partitioning. Instead of organizing our data through partitioning, we should rethink how we organize our data when working with cloud object stores. In this post, we discuss how data skipping and Z-Order reduce the amount of data that needs to be scanned. This translates to significant query performance improvements and reductions in cost using clustering techniques.

Improve Performance by Data Skipping

Traditionally when querying a data lake, the query engine’s first action is to list the files prior querying those files. On cloud object stores like S3, this listFrom action is very resource intensive. Lakehouse storage formats like Delta Lake take the best of both worlds of data lakes and data warehouses. It provides the ACID transaction reliability of data warehouses while providing the scalability and flexibility of data lakes. Delta Lake achieves through its transaction log as noted in the following diagram below of rust-lang client querying a Delta table.

Rust client querying Delta Lake table via the Delta transaction log

Delta Lake provides the query engine (e.g., Rust Lang via delta-rs) the exact paths to the files that make up the table. Therefore, with Delta Lake you significantly reduce the performance issues associated with file listing (i.e., listFrom).

Rust client receiving less data (i.e., less files) via data skipping through the Delta transaction log

How Delta improves query performance via data skipping is by using column statistics (min/max values) stored in the Delta Log, For example, the query engine will filter out any non-2023 data files (*.parquet) by using the min/max values in the transaction log. By skipping data, the query engine will read less data thus having faster performance.

Multi-dimensional Points to One-dimension Values

Clustering data so that the min-max ranges are narrow and non-overlapping makes data skipping even more effective. For example, in the preceding image, here is the data file structure.

# Previous file clustering configuration
1.parquet {min: 2023, max: 2023}
2.parquet {min: 1999, max: 2020}
3.parquet {min: 2009, max: 2018}

In this example, querying for data in 2010 would result needing to read both 2.parquet and 3.parquet. Instead, it would make more sense to configure per the following.

# Optimal year file clustering configuration
1.parquet {min: 2019, max: 2023}
2.parquet {min: 2014, max: 2018}
3.parquet {min: 2009, max: 2013}

This simplified example works great for a single column but this becomes much more complex when we need to cluster by multiple columns. This is a common problem for relational databases where we’re trying to assign multiple columnar values that we typically query/filter/group by to the same file. Mathematically, we want preserve locality by mapping multi-dimensional points to one-dimensional values. Therefore, we want to utilize locality-preserving space-filling curves, the most commonly used ones being the Z-order and Hilbert curves.

A great reference on this topic is Adrian Ionescu‘s post Processing Petabytes of Data in Seconds with Databricks Delta.

Z-Order Clustering Primer

Z-Order is a locality-preserving space filling curve that allows you to skip data with filters over multiple columns. Conceptually if you were to treat each 4×1 rectangle as a file, when you query the data it will serially process each file through linear scans.

Instead of querying data linearly, organize the data in quadrants and visit each quadrant in z-order

But if we organize data via Z-Order, conceptually each file is a 2×2 quadrant. File scanning is in “Z” order (0, 1, 2, 3) in a recursive fashion. Let’s say you have the following SQL query with the data represented by two dimensions or axes (x, y).

Looking at data in two dimensions, the optimal path would be to only read at x=2, y = 2.

The green circles with yellow background) represent the data we want to read. Meanwhile, the red data represents the data we want to skip. Recall for Z-Order, a 2×2 square (dotted line) represents each file.

Linear Scanning

Although when querying the data linearly, a 4×1 rectangle (dotted line) represents each file. Scanning linearly results in reading nine (9) files (yellow background, dotted line rectangles). There are also 21 false positives via the red circles in yellow background.

Scanning the data linearly results in 9 files read and 21 false positives

Z-Order Scanning

On the other hand, we can reduce the number of files and false positives by using Z-Order clustering. Each file is represented by a 2×2 square (dotted lines) and is scanned in Z-order (light blue). But the connection between each Z (i.e., each file) changes depending on the data itself..

Using Z-Order clustering, only 7 files are read with 13 false positives

In this example, only 7 (vs. 9) files are scanned by the query engine. In addition, there are only 13 (vs. 21) false positives.

Discussion

The preceding is a simplified 2-dimension (column) example. But this type of clustering can be applied to multiple columns. A general rule of thumb is that this technique is effective between five (5) to eight (8) columns. Instead of being limited to one or a few columns with partitioning, there is significantly more flexibility when your cluster the data using Z-Order.

The next post How Delta Lake Liquid Clustering conceptually works describes how Liquid clustering provides even more flexibility and performance.

dennyglee

clustering, data skipping, zorder

Denny Lee