Introduction to Apache Spark

This article covers all you need to know about Spark and its ecosystem with a detailed dive into of the main concepts of Spark and its architecture.

What is Spark ?

What is Spark used for ?

Spark ecosystem :

Apache Spark ecosystem

The different components of Apache Spark are:

  • Apache Spark Core, which provides in-memory computing, and forms the basis of other components
  • Spark SQL, which provides structured and semi-structured data abstraction
  • Spark Streaming, which performs streaming analysis using RDD (Resilient Distributed Datasets) transformation
  • MLlib (Machine Learning Library), which is a distributed machine learning framework above Spark
  • GraphX, which is a distributed graph processing framework on top of Spark
  • BigDL is a distributed deep learning library for Apache Spark

The Apache Spark architecture principle :

Task parallelism and in-memory computing are the key to being ultra-fast.

Apache Spark architecture

The Run-time architecture of Spark consists of these parts:

1. Spark Driver (Master Process)

The Spark Driver converts the programs into tasks and schedules the tasks for Executors. The Task Scheduler is the part of Driver and helps to distribute tasks to Executors.

2. Spark Cluster Manager

A cluster manager is the core in Spark that allows to launch executors and sometimes drivers can be launched by it also. Spark Scheduler schedules the actions and jobs in Spark Application in FIFO way on cluster manager itself. You should also read about Apache Airflow.

3. Executors (Slave Processes)

Executors are the individual entities on which individual task of Job runs. Executors will always run till the lifecycle of a spark Application once they are launched. Failed executors don’t stop the execution of spark job.

4. RDD (Resilient Distributed Datasets)

An RDD is a distributed collection of immutable datasets on distributed nodes of the cluster. An RDD is partitioned into one or many partitions. RDD is the core of spark as their distribution among various nodes of the cluster that leverages data locality. To achieve parallelism inside the application, Partitions are the units for it. Repartition or coalesce transformations can help to maintain the number of partitions. Data access is optimized utilizing RDD shuffling. As Spark is close to data, it sends data across various nodes through it and creates required partitions as needed.

5. DAG (Directed Acyclic Graph)

Spark tends to generate an operator graph when we enter our code to Spark console. When an action is triggered to Spark RDD, Spark submits that graph to the DAGScheduler. It then divides those operator graphs to stages of the task inside DAGScheduler. Every step may contain jobs based on several partitions of the incoming data. The DAGScheduler pipelines those individual operator graphs together. For Instance, Map operator graphs schedule for a single-stage and these stages pass on to the. Task Scheduler in cluster manager for their execution. This is the task of Work or Executors to execute these tasks on the slave.

Machine Learning engineer and Evangelist at FIWARE.