A common approach to developing data pipelines is to use Structured Streaming (aka Spark Structured Streaming) with Delta Lake. This allows you to perform near-real data processing using the same APIs for batch processing. All of this is on top of Delta Lake to ensure the transactional consistency of your data. With Project Lightspeed, Structured Streaming latency can achieve latencies lower than 250 ms, addressing use cases such as real-time monitoring and alerting.
So why would you use a system designed for near-real-time (and rapidly more real-time) scenarios for batch processing? After all, common batch ETL processes have latencies in the minutes to hours typical of data warehousing scenarios.
Simpler and faster ETL with less compute
Before we dive into the technical details, let’s start with a cool use case. The Spark+AI Summit 2019 presentation, Building Sessionization Pipeline at Scale with Databricks Delta, had some interesting details.
- Structured streaming provides exactly-once semantics (more later), ensuring the reliability of the data.
- By using
Trigger.Once(now deprecated, use
Trigger.AvailableNowinstead), they could reduce their compute instances by an order of magnitude.
- This allowed the team to reduce from 84 batch to 3 streaming jobs, halving their data latency from ~14 hours to ~7 hours.
- By using
- Delta Lake provides transactional protection, allowing structured streaming to perform DML operations. Therefore, their data pipelines could perform operations such as deletes, updates, and merges.
- The result was scalable data pipelines that reliably provided insight.
Their scenario did not require real-time or near-real-time processing. Yet, with structured streaming and Delta Lake, they could reduce compute and simplify their data pipelines, all the while improving data reliability.
Diving deeper into how this works
In the next section, we will discuss the internals of how this is possible. Initially, we will explore the concept of exactly-once guarantees (or semantics).
What are Exactly-Once guarantees?
Past systems provided different guarantees, such as at-least-once which ensured no missing data but potentially having duplicates. But the desired state is to ensure no missing data and no duplicates, i.e., exactly-once semantics.
For a system to provide exactly-once semantics:
- It will track the files that are currently processing (black box), failing (red x), or will need reprocessing (red box).
- It will track which files successfully completed (yellow box)
- It will track which files are new and need processing (blue box)
To perform this task, the processing engine will need to store a lot of metadata to keep track of all of these files and their state.
What does a stream processing engine do?
Therefore, to have a stream processing engine provide exactly-once semantics, it must track all of these files and their current state.
Learn more about the details via the Structured Streaming Programming Guide > Programming Model. The key tenet is that structured streaming tracks files and stores metadata that is fault-tolerant and ensures these files are processed exactly-once.
What is the difference between streaming and batch processing?
Within this context of structured streaming, what is the difference between stream and batch processing?
Simply put, the key difference is latency.
As noted earlier, irrelevant of latency, to provide exactly-once semantics the system will track the files and stores the state in a fault-tolerant manner. With Apache Spark, you get this out of the box for both streaming and batch without the need to reason about exactly-once semantics.
The simplest way to perform streaming analytics is not having to reason about streaming at allApache Spark 2.0: A Deep Dive Into Structured Streaming – by Tathagata Das
For more information, watch Tathagata Das present Apache Spark 2.0: A Deep Dive Into Structured Streaming at Spark Summit 2016.
Decoupling Business Logic from Latency
Both streaming and batch processing have the same set of metadata tracking problems; they are just performed at different latencies. For streaming scenarios, your clusters must stay online to continuously process incoming data (top). But for batch processing, you can shut down your clusters after processing by leveraging the trigger
Whenever you need to change the processing latency, change the trigger from
Continuous. The core business logic and code (e.g., applying windowing functions, aggregations, etc.) of how and what your process does not need to change, just your latency.
Delta Lake completes exactly-once semantics
An important capability missing in all of this is that for structured streaming alone, the exactly-once semantics is insert only. Cleaning up data, GDPR deletes, and data reprocessing are among the many examples where we need more (e.g., updates, deletes, merges, etc.). These tasks require the duplication, modification, and deletion of files. For example, when reprocessing data, you must delete and create files in the correct order. Performing these tasks out of order will lead to partially written (i.e., corrupt) files or deleting the wrong files. And this is where Delta Lake completes the picture for exactly-once semantics.
Delta Lake provides ACID transaction guarantees between reads and writes. It brings reliability to your data lakehouse by providing serializable isolation levels, ensuring that readers never see inconsistent data.
Because Delta Lake provides these guarantees, your Structured Streaming jobs can safely perform DML operations such as
Therefore, to complete the picture of exactly-once semantics, we add Delta Lake as your lakehouse storage format. This will provide reliability to both your data and metadata to guarantee its state.
Apache Spark™ Structured Streaming and Delta Lake together provide end-to-end exactly-once semantics. Using Structured Streaming allows you to decouple business logic (and its code) from latency. The reliability and guarantees that the systems together provide allow you (such as the preceding Comcast case) to reduce the number of compute required because there are fewer failures and fewer resources required to reprocess data.
For more information on Apache Spark and Delta Lake: