How does Spark decide the number of tasks and number of tasks to execute in parallel? - Big Data In Real World

How does Spark decide the number of tasks and number of tasks to execute in parallel?

How to save Spark DataFrame directly to a Hive table?
May 31, 2021
How to export a Hive table into a CSV file?
August 6, 2021
How to save Spark DataFrame directly to a Hive table?
May 31, 2021
How to export a Hive table into a CSV file?
August 6, 2021

In this post we will see how Spark decides the number of tasks and number of tasks to execute in parallel in a job.

Let’s see how Spark decides on the number of tasks with the below set of instructions.

  1. READ dataset_X
  2. FILTER on dataset_X
  3. MAP operation on dataset_X
  4. READ dataset_Y
  5. MAP operation on dataset_Y
  6. JOIN dataset_X and dataset_Y
  7. FILTER on joined dataset
  8. SAVE the output

Let’s also assume dataset_Y has 10 partitions and dataset_Y has 5 partitions.

Do you like us to send you a 47 page Definitive guide on Spark join algorithms? ===>

Stages and number of tasks per stage

Spark will create 3 stages –

First stage – Instructions 1, 2 and 3

Second stage – Instructions 4 and 5

Third stage – Instructions 6, 7 and 8

Number of tasks in first stage 

First stage reads dataset_X and dataset_X has 10 partitions. So stage 1 will result in 10 tasks.

If your dataset is very small, you might see Spark still creates 2 tasks and this is because Spark looks at the defaultMinPartitions property and this property decides the minimum number of tasks Spark can create. The default for defaultMinPartitions is 2. 

Number of tasks in second stage 

Second stage reads dataset_Y and dataset_Y has 5 partitions. So stage 2 will result in 5 tasks.

Number of tasks in third stage 

Third stage executes a JOIN and JOIN operation triggers a wide transformation and wide transformation will result in a shuffle. Spark optimizer tries to pick the “right” number of partitions during a shuffle but most often you will see Spark creates 200 tasks for stages executing wide transformation operations like JOIN, GROUP BY etc.

spark.sql.shuffle.partitions  property controls the number of partitions during a shuffle and the default value of this property is 200. 

Change the value of spark.sql.shuffle.partitions  to change the number of partitions during a shuffle.

sqlContext.setConf("spark.sql.shuffle.partitions", "8”)

 

Number of tasks execution in parallel

Number of CPU cores available for an executor determines the number of tasks that can be executed in parallel for an application for any given time.

Let’s say, you have 5 executors available for your application. Each executor is assigned 10 CPU cores. 

5 executors and 10 CPU cores per executor = 50 CPU cores available in total.

With the above setup, Spark can execute a maximum of 50 tasks in parallel at any given time.

 

Big Data In Real World
Big Data In Real World
We are a group of Big Data engineers who are passionate about Big Data and related Big Data technologies. We have designed, developed, deployed and maintained Big Data applications ranging from batch to real time streaming big data platforms. We have seen a wide range of real world big data problems, implemented some innovative and complex (or simple, depending on how you look at it) solutions.

1 Comment

  1. […] Number of tasks equals the number of partitions in a dataset. Check this for more details. […]

How does Spark decide the number of tasks and number of tasks to execute in parallel?
This website uses cookies to improve your experience. By using this website you agree to our Data Protection Policy.

Hadoop In Real World is now Big Data In Real World!

X