[Data Engineering] ETL Batch Pipeline and Streaming Pipeline

6 min readJan 31, 2023

Before you start reading this article, if you’d like to read the Mandarin version, please click below link and enjoy :)

[資料工程] ETL Batch & Streaming Pipeline 場景與選擇

前言

jackyfu1995.medium.com

ETL (Extract, Transform, Load) is one of the essential skills for a data engineer. However, depending on the scenario, when should data be processed in batch and when should it be processed in real-time?

The most common answer is that it depends on the acceptable data latency for the application using the data.

Batch Processing:

Ensures that data is updated correctly, but has some tolerance for data latency in scenarios (often measured in hours, days, or weeks). For example, a postal package delivery progress inquiry system only updates its status at every 10 minutes of a whole hour.

Common features include:

Specific and repetitive execution conditions, such as a clear time point (8 am every Monday), time interval (every hour), or data volume (every 100 MB).
Situations where data must be collected in full before it can be processed, such as sorting all sales data for a business report.
No urgent need for the latest or real-time data, but a high demand for data accuracy.

Common business scenarios:

Regular data backup
Mid- to long-term trend prediction
Information and tables with rare changes
Analysis of historical transactions and reports

Streaming

Data needs to be processed and used in real-time, with little to no latency (usually measured in milliseconds to 1 minute). For example, financial institutions tend to notify consumers more immediately about potential fraud or credit card theft.

Common characteristics include:

Continuous event triggers that are processed and responded to immediately.
Unpredictable and irregular data volume and frequency of triggers by users.
Recording the time each trigger is executed.
The use of message queues to handle queues caused by a high volume of incoming data.

Common business scenarios include:

Processing of audio and streaming media data
Fraud alerts and suspicious card transaction notifications
Real-time advertising or recommendations based on user behavior
High-frequency stock trading in finance.

Is it possible to have both

Is it possible to have both high accuracy and low latency simultaneously when dealing with both batch and streaming data?

Yes, it is possible to achieve both high accuracy and low latency in data processing. The Lambda Architecture, proposed by Nathan Marz, the creator of Storm, was created to address this challenge. The core idea of the Lambda Architecture is to duplicate the pipeline into two streams (S and B) with S processing the streaming data to achieve low latency and sending the results to the Speed Layer, while B performs batch processing and sends the results to the Batch Layer. Eventually, the results from both layers are combined in the Serving Layer. If the results from the Batch Layer are available, they overwrite the results from the Speed Layer, ensuring the high accuracy of data even if errors occurred during the Streaming process.

Assuming a scenario:

A Batch Processing was originally executed once a day, we adjusted it to be executed every hour, then changed it to every minute, and finally changed it to every second. Under extreme conditions, can we claim that this pipeline is a Streaming service? What’s the difference?

This raises the concept of Micro-Batch, which is different from Streaming. Here is my perspective. First, we need to clarify what kind of service this is. For example, if it’s a business report from last quarter, the result will be the same no matter how many times it is executed, which is just a waste of computational resources. Or it’s a video recommendation system based on user behavior, then I would ask why Streaming wasn’t used in the first place. It could be due to limited computational resources, special business considerations, or other data flow bottlenecks that only allow Batch to be used.

To summarize, there is a difference between Micro-Batch and Streaming architectures. Micro-Batch is a compromise between Streaming and Batch processing, where the data is processed in small batches over a short period of time. Spark Streaming is an example of Micro-Batch processing. On the other hand, some consider Kafka Stream to be a true Streaming architecture as it processes data record by record in real-time. The choice between Micro-Batch and Streaming depends on the data delay tolerance and the practical requirements of the service, as well as the underlying infrastructure. The size of the batch in Micro-Batch processing balances latency and throughput, and the choice of architecture should be based on an understanding of the practical scenario.

Streaming pipelines also have an acknowledgment mechanism. When a task is completed, the pipeline will notify the producer. If the pipeline does not receive acknowledgment within a certain amount of time, the producer will resend the same message, which is an At-Least-Once mechanism. This can result in duplicated data, but one way to achieve Exactly-Once semantics is by using a Micro-Batch architecture.

As a data engineer in a team

what aspects may influence the choice between a Batch or Streaming solution during discussions for developing a new product?

Efficiency & Feasibility

Without a clear requirement for Data Latency, when Batch or Streaming are both feasible options, it’s wise to focus on implementing the data pipeline logic first. Batch processing is usually a practical choice for testing purposes at the early stage of product development, as there are few users, and Streaming requires maintaining more services, such as Message Queue, which can easily consume efforts on dealing with tracking data or avoiding duplicated messages.

Cost

Many Cloud Services offer Pay-as-you-use services, which means the service is activated only when computation is needed and charges based on usage. In this case, the cost of Batch processing is often more economical than the Streaming pipeline that needs to be standby all the time. It’s still important to discuss the overall budget with the architecture team and leaders, as budget often directly determines the solution. (After all, cost consideration is usually more important for the company)

Application Scenario

Take Netflix’s recommendation system as an example. Based on a user’s historical behavior, enough recommendations can be calculated and updated in Batch mode, and users usually don’t notice. However, Netflix offers a family plan and some users even privately lend their accounts to friends (don’t do that), which requires a Streaming pipeline to provide more real-time recommendations that fit the current user’s preferences.

Conclusion

Through this record, I want to deepen my understanding of the choice between Batch or Streaming. In fact, as the above example of Netflix’s recommendation system shows, even the same recommendation system can adjust the pipeline construction because of user habits, usage scenarios, and product positioning. Therefore, communicating with team members is often the best remedy when exploring various solutions!

Reference:
1. https://stackoverflow.com/questions/65491431/why-so-much-criticism-around-spark-streaming-micro-batch-when-using-kafka-as-so

2. https://zhuanlan.zhihu.com/p/38483883

3. https://blog.devgenius.io/building-a-streaming-data-pipeline-on-ubuntu-20-04-8fa9e6f9cced

4. https://ithelp.ithome.com.tw/articles/10161494

If you find the article helpful, please give me some applause. Hold your claps for 10 seconds, and each person can clap up to 50 times! The articles will be updated continuously and mainly about data analysis, data development notes, and some personal thoughts. Welcome friends with interest to follow and share.