Why Structured Streaming and Delta Lake for Batch ETL?

August 25, 2023

A common approach to developing data pipelines is to use Structured Streaming (aka Spark Structured Streaming) with Delta Lake. This allows you to perform near-real data processing using the same APIs for batch processing. All of this is on top of Delta Lake to ensure the transactional consistency of your data. With Project Lightspeed, Structured Streaming latency can achieve latencies lower than 250 ms, addressing use cases such as real-time monitoring and alerting.

So why would you use a system designed for near-real-time (and rapidly more real-time) scenarios for batch processing? After all, common batch ETL processes have latencies in the minutes to hours typical of data warehousing scenarios.

Simpler and faster ETL with less compute

Before we dive into the technical details, let’s start with a cool use case. The Spark+AI Summit 2019 presentation, Building Sessionization Pipeline at Scale with Databricks Delta, had some interesting details.

Structured Streaming and Delta Lake reduced their jobs from 84 to 3 with 10x less compute

Structured streaming provides exactly-once semantics (more later), ensuring the reliability of the data.
- By using Trigger.Once (now deprecated, use Trigger.AvailableNow instead), they could reduce their compute instances by an order of magnitude.
- This allowed the team to reduce from 84 batch to 3 streaming jobs, halving their data latency from ~14 hours to ~7 hours.
Delta Lake provides transactional protection, allowing structured streaming to perform DML operations. Therefore, their data pipelines could perform operations such as deletes, updates, and merges.
The result was scalable data pipelines that reliably provided insight.

Their scenario did not require real-time or near-real-time processing. Yet, with structured streaming and Delta Lake, they could reduce compute and simplify their data pipelines, all the while improving data reliability.

Diving deeper into how this works

In the next section, we will discuss the internals of how this is possible. Initially, we will explore the concept of exactly-once guarantees (or semantics).

What are Exactly-Once guarantees?

Past systems provided different guarantees, such as at-least-once which ensured no missing data but potentially having duplicates. But the desired state is to ensure no missing data and no duplicates, i.e., exactly-once semantics.

For a system to provide exactly-once semantics:

It will track the files that are currently processing (black box), failing (red x), or will need reprocessing (red box).
It will track which files successfully completed (yellow box)
It will track which files are new and need processing (blue box)

To perform this task, the processing engine will need to store a lot of metadata to keep track of all of these files and their state.

What does a stream processing engine do?

Therefore, to have a stream processing engine provide exactly-once semantics, it must track all of these files and their current state.

Apache Spark Structured Streaming provides exactly-once semantics

Learn more about the details via the Structured Streaming Programming Guide > Programming Model. The key tenet is that structured streaming tracks files and stores metadata that is fault-tolerant and ensures these files are processed exactly-once.

What is the difference between streaming and batch processing?

Within this context of structured streaming, what is the difference between stream and batch processing?

With Apache Spark Structured Streaming, the difference between streaming and batch is latency

Simply put, the key difference is latency.

As noted earlier, irrelevant of latency, to provide exactly-once semantics the system will track the files and stores the state in a fault-tolerant manner. With Apache Spark, you get this out of the box for both streaming and batch without the need to reason about exactly-once semantics.

The simplest way to perform streaming analytics is not having to reason about streaming at all
Apache Spark 2.0: A Deep Dive Into Structured Streaming – by Tathagata Das

For more information, watch Tathagata Das present Apache Spark 2.0: A Deep Dive Into Structured Streaming at Spark Summit 2016.

Learn more about Structured Streaming in this deep dive

Decoupling Business Logic from Latency

Both streaming and batch processing have the same set of metadata tracking problems; they are just performed at different latencies. For streaming scenarios, your clusters must stay online to continuously process incoming data (top). But for batch processing, you can shut down your clusters after processing by leveraging the trigger Trigger.AvailableNow.

Streaming latencies (top) and batch latencies (bottom)

Whenever you need to change the processing latency, change the trigger from AvailableNow to Continuous. The core business logic and code (e.g., applying windowing functions, aggregations, etc.) of how and what your process does not need to change, just your latency.

Core business logic and code does not need to change when you change latencies

Delta Lake completes exactly-once semantics

An important capability missing in all of this is that for structured streaming alone, the exactly-once semantics is insert only. Cleaning up data, GDPR deletes, and data reprocessing are among the many examples where we need more (e.g., updates, deletes, merges, etc.). These tasks require the duplication, modification, and deletion of files. For example, when reprocessing data, you must delete and create files in the correct order. Performing these tasks out of order will lead to partially written (i.e., corrupt) files or deleting the wrong files. And this is where Delta Lake completes the picture for exactly-once semantics.

Delta Lake provides ACID transaction guarantees between reads and writes. It brings reliability to your data lakehouse by providing serializable isolation levels, ensuring that readers never see inconsistent data.

Learn how Delta Lake makes Apache Spark better by bringing reliability to data lakehouses

Because Delta Lake provides these guarantees, your Structured Streaming jobs can safely perform DML operations such as DELETE, UPDATE, and MERGE.

Learn more about how Delete, Update, and Merge work in Delta Lake

Therefore, to complete the picture of exactly-once semantics, we add Delta Lake as your lakehouse storage format. This will provide reliability to both your data and metadata to guarantee its state.

Apache Spark Structured Streaming and Delta Lake provide complete exactly-once semantics

Discussion

Apache Spark™ Structured Streaming and Delta Lake together provide end-to-end exactly-once semantics. Using Structured Streaming allows you to decouple business logic (and its code) from latency. The reliability and guarantees that the systems together provide allow you (such as the preceding Comcast case) to reduce the number of compute required because there are fewer failures and fewer resources required to reprocess data.

Addendum

For more information on Apache Spark and Delta Lake:

Download O’Reilly | Learning Spark, 2nd edition, for free
Access the early release of O’Reilly | Delta Lake: The Definitive Guide

dennyglee

Denny Lee

Leave a ReplyCancel reply