Why do I see 200 tasks in Spark execution?

How to list all the columns in a Hive table?

September 22, 2021

How to get the current date and time in Hive?

September 27, 2021

Published by Big Data In Real World at September 24, 2021

How to change the default 200 tasks?

spark.sql.shuffle.partitions property controls the number of partitions during a shuffle and the default value of this property is 200.

Change the value of spark.sql.shuffle.partitions to change the number of partitions during a shuffle.

sqlContext.setConf("spark.sql.shuffle.partitions", "4”)

Should you change the default?

Short answer is – it depends.

200 partitions could be a lot when your data volume is small. Why? Because, each partition is processed by a task and each task is processing a small amount of data and this could result in performance issues. Very simply, you can decrease the number of partitions which would result in fewer tasks and that would result in better performance.

If you are writing the data after a shuffle and if your data volume is small, you would result with 200 small files by default and this is another reason why you might want to consider changing the default to a smaller number.

200 partitions could be low when the amount of data involved in shuffle is huge. If a task is processing a lot of data you could see out of memory exceptions or slower tasks executions. This could be rectified by increasing the number of partitions there by increasing the number tasks which in turn allow each task to process a manageable amount of data.

Big Data In Real World

We are a group of Big Data engineers who are passionate about Big Data and related Big Data technologies. We have designed, developed, deployed and maintained Big Data applications ranging from batch to real time streaming big data platforms. We have seen a wide range of real world big data problems, implemented some innovative and complex (or simple, depending on how you look at it) solutions.

Comments are closed.

Why do I see 200 tasks in Spark execution?

How to list all the columns in a Hive table?

How to get the current date and time in Hive?

How to list all the columns in a Hive table?

How to get the current date and time in Hive?

How to change the default 200 tasks?

Should you change the default?

Big Data In Real World

Related posts

How to kill a running Spark application?

What is the default number of executors in Spark?

What is the default number of cores and amount of memory allocated to an application in Spark?