How does Spark decide stages and tasks during execution of a Job? - Big Data In Real World

How does Spark decide stages and tasks during execution of a Job?

What are accumulators in Spark, when and when not to use them?
September 15, 2021
What is the difference between partitioning and bucketing a table in Hive?
September 20, 2021
What are accumulators in Spark, when and when not to use them?
September 15, 2021
What is the difference between partitioning and bucketing a table in Hive?
September 20, 2021

Let’s see this with an example. Here is our series of instructions in our Spark code. Let’s see how Spark decide on stages and tasks with the below set of instructions.

  1. READ dataset_X
  2. FILTER on dataset_X
  3. MAP operation on dataset_X
  4. READ dataset_Y
  5. MAP operation on dataset_Y
  6. JOIN dataset_X and dataset_Y
  7. FILTER on joined dataset
  8. SAVE the output

Do you like us to send you a 47 page Definitive guide on Spark join algorithms? ===>

Stages

Spark will create a stage for each dataset

All consecutive narrow transformations (for eg. FILTER, MAP etc.) will be grouped together inside the stage

Spark will create a stage when it encounter a wide transformation (for eg. JOIN, reduceByKey etc.). 

For the above set of instructions, Spark will create 3 stages –

First stage – Instructions 1, 2 and 3

Second stage – Instructions 4 and 5

Third stage – Instructions 6, 7 and 8

Tasks

Spark creates a task to execute a set of instructions inside a stage.

Number of tasks equals the number of partitions in a dataset. Check this for more details.

Task execute all consecutive narrow transformations inside a stage – it is called pipelining.

Task in first stage will execute instructions 1, 2 and 3

Task in second stage will execute instructions 4 and 5

Task in the third stage will execute instructions 6, 7 and 8.

Big Data In Real World
Big Data In Real World
We are a group of Big Data engineers who are passionate about Big Data and related Big Data technologies. We have designed, developed, deployed and maintained Big Data applications ranging from batch to real time streaming big data platforms. We have seen a wide range of real world big data problems, implemented some innovative and complex (or simple, depending on how you look at it) solutions.

Comments are closed.

How does Spark decide stages and tasks during execution of a Job?
This website uses cookies to improve your experience. By using this website you agree to our Data Protection Policy.

Hadoop In Real World is now Big Data In Real World!

X