This article covers all you need to know about Spark and its ecosystem with a detailed dive into of the main concepts of Spark and its architecture.
What is Spark ?
Apache Spark is an open-source distributed general-purpose cluster-computing framework. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.
What is Spark used for ?
Apache Spark is used for big data workloads. It utilizes in-memory caching and optimized query execution for fast queries against data of any size. Simply put, Spark is a fast and general engine for large-scale data processing.
Spark ecosystem :
Apache Spark is the leading platform for large-scale SQL, batch processing, stream processing and machine learning, with an easy-to-use API, and for coding in Spark, you have the option of using different programming languages, including Java, Scala, Python, R and SQL.
The different components of Apache Spark are:
- Apache Spark Core, which provides in-memory computing, and forms the basis of other components
- Spark SQL, which provides structured and semi-structured data abstraction
- Spark Streaming, which performs streaming analysis using RDD (Resilient Distributed Datasets) transformation
- MLlib (Machine Learning Library), which is a distributed machine learning framework above Spark
- GraphX, which is a distributed graph processing framework on top of Spark
- BigDL is a distributed deep learning library for Apache Spark
The Apache Spark architecture principle :
Apache Spark stores data in RDD (Resilient Distributed Datasets), which is an immutable distributed collection of objects, and then divides it into different logical partitions, so it can process each part in parallel, in different nodes of the cluster.
Task parallelism and in-memory computing are the key to being ultra-fast.
The Run-time architecture of Spark consists of these parts:
1. Spark Driver (Master Process)
The Spark Driver converts the programs into tasks and schedules the tasks for Executors. The Task Scheduler is the part of Driver and helps to distribute tasks to Executors.
2. Spark Cluster Manager
A cluster manager is the core in Spark that allows to launch executors and sometimes drivers can be launched by it also. Spark Scheduler schedules the actions and jobs in Spark Application in FIFO way on cluster manager itself. You should also read about Apache Airflow.
3. Executors (Slave Processes)
Executors are the individual entities on which individual task of Job runs. Executors will always run till the lifecycle of a spark Application once they are launched. Failed executors don’t stop the execution of spark job.
4. RDD (Resilient Distributed Datasets)
An RDD is a distributed collection of immutable datasets on distributed nodes of the cluster. An RDD is partitioned into one or many partitions. RDD is the core of spark as their distribution among various nodes of the cluster that leverages data locality. To achieve parallelism inside the application, Partitions are the units for it. Repartition or coalesce transformations can help to maintain the number of partitions. Data access is optimized utilizing RDD shuffling. As Spark is close to data, it sends data across various nodes through it and creates required partitions as needed.
5. DAG (Directed Acyclic Graph)
Spark tends to generate an operator graph when we enter our code to Spark console. When an action is triggered to Spark RDD, Spark submits that graph to the DAGScheduler. It then divides those operator graphs to stages of the task inside DAGScheduler. Every step may contain jobs based on several partitions of the incoming data. The DAGScheduler pipelines those individual operator graphs together. For Instance, Map operator graphs schedule for a single-stage and these stages pass on to the. Task Scheduler in cluster manager for their execution. This is the task of Work or Executors to execute these tasks on the slave.