What is Apache Spark in Big Data? Apache Spark is an open-source cluster computing framework developed by AMPLab. Unlike Hadoop’s two-stage disk-based MapR paradigm, Spark’s in-memory primitives provide performance up to 100 times faster for certain applications. Now that we have refreshed the definition of Apache Spark, let us get back to our topic of the day – “Streaming in Apache Spark.” Here we will talk about the streaming components in Spark, main components in streaming application and finally the main steps in streaming applications. Let’s get started.
Spark Streaming Components – Data Model
When we talk about Spark, it is important to understand that all the data is modelled as Resilient Distributed Datasets. These are built by design with pedigree of deterministic operations. This means that no matter how many times you compute the data it will always lead you to the same result. It is almost similar to Hadoop’s fault-tolerance for slave failures with an exception of the mechanism used.
The Resilient Distributed Datasets or RDDs are immutable, deterministically re-computable and distributed datasets in Spark. A DStream on the other hand is an abstraction used in Spark streaming over RDDs. DStream gets its name from being a stream of RDDs. Majority of the APIs applies over DStreams.
Another important component of Spark Streaming is the nodes. There are basically two types of nodes –
- Worker Node: Worker nodes are slave nodes which run the application code on the cluster.
- Driver Node: Driver nodes are the main program of the application. They are quite similar to the application master in Hadoop YARN environment. Driver nodes own the Spark context; thus, the entire state of application.
Main Components in Streaming Applications
Here is a brief about various components used in streaming applications –
- Driver: This component is similar to the master node in a Storm application from a conceptual viewpoint.
- Receiver: This component which resides inside a worker node is same as Spout in Apache Storm. The Receiver consumes the data from source. Besides, there are OOTB (built-in receivers) for the common ones.
- Executor: The Executor is used to process data in a similar fashion as Bolt does in Apache Storm.
Having discussed the components in the Spark Streaming, now let us understand the main steps which are involved in a Spark Streaming Application.
Main Steps in Streaming Applications
There are three basic steps involved in a streaming application; hence, it is very important to understand the record processing guarantees at each step. These record processing guarantees can semantically happen at least once, at most once or exactly once.
- Receiving the Streaming Data – It depends on the input source available at this step which determines the usage of receivers (reliable vs. Unreliable). To cite an example, a stream from Kafka is reliable but direct data from socket connections is unreliable.
It is a default function to replicate data in memory to two worker nodes when it is received in from any receiver in a Spark streaming. Once this happens and the receiver happens to be reliable, an acknowledgement is sent; otherwise, if the receiver was unreliable, the data is lost. This means at least one of these scenarios will happen.
In case the Driver node fails, it causes in the loss of Spark context and hence, all the previous data. One of the remedial measures in such cases is mechanism of a Spark WAL (write ahead logs) but in a cleaner manner and also, if the data sender allows, the other option is to reuse and consume their WAL instead.
- Transform the Data – This is the stage where we have a guarantee of exactly-once semantics. This happens because of the underlying RDD guarantees. In a situation where the worker node fails, the computation of the transformation gets done on other node where the data is replicated.
- Output the Transformed Data – Output operations have at-least-once semantics. This means that the data which is transformed might get written to an external entity more than once in a worker failure scenario. Also, it needs to be understood that it may require extra efforts to achieve exactly-once semantics. There can be two approaches for doing so –
- Idempotent Updates – Several attempts write the same data always.
- Transactional Updates – In order to make the updates exactly once, the updates are made atomically.
Let us understand this with the help of an example. Assuming there is a batch of events and in this batch, one of the operations is to maintain “global count” in a way that it keeps a counter of total events streamed till now. Now consider this – What if the node that was processing fails mid-way while the batch of events is being processed?
Is the global count reflecting the “half way events” processed? If we only consider the global count, then there is built-in global counter available in Apache Spark which has the capability to handle this issue. But since this is a hypothetical situation, for all other situations besides counter, the lineage of transformation applied on the whole batch of data will act as a complete remedy. As discussed earlier, RDD transformations are deterministically re-computable which means that the re-computation will give similar results. However, if you wish to store the result externally then that logic has to be independently handled.
There is a lot more to Apache Spark which you can learn if you wish to enter into the field of Data Analytics. Collabera TACT has a great training program on Apache Spark and Storm. If Big Data excites you or you are already working as a Hadoop professional, then this is one framework which will give your career the required boost.