How to subtract or see differences between two DataFrames in Spark? - Big Data In Real World

How to subtract or see differences between two DataFrames in Spark?

LeaseExpiredException: No lease error on HDFS
August 23, 2021
How to calculate median in Hive?
August 27, 2021
LeaseExpiredException: No lease error on HDFS
August 23, 2021
How to calculate median in Hive?
August 27, 2021

Pretty simple. Use the except() to subtract or find the difference between two dataframes.

Do you like us to send you a 47 page Definitive guide on Spark join algorithms? ===>

Solution

except()  will list the elements that are in dataframe 1 and not in dataframe 2. except() will still remove an element even if the element is listed multiple times in dataframe 1 and only once in dataframe 2.

import spark.implicits._ 

scala> val data1 = Seq(10, 20, 20, 30, 40)
data1: Seq[Int] = List(10, 20, 20, 30, 40)


scala> val data2 = Seq(20, 30)
data2: Seq[Int] = List(20, 30)


scala> val df1 = data1.toDF()
df1: org.apache.spark.sql.DataFrame = [value: int]


scala> val df2 = data2.toDF()
df2: org.apache.spark.sql.DataFrame = [value: int]


scala> df1.except(df2).show
+-----+
|value|
+-----+
|   40|
|   10|
+-----+

scala> df2.except(df1).show

+-----+
|value|
+-----+
+-----+


 

Big Data In Real World
Big Data In Real World
We are a group of Big Data engineers who are passionate about Big Data and related Big Data technologies. We have designed, developed, deployed and maintained Big Data applications ranging from batch to real time streaming big data platforms. We have seen a wide range of real world big data problems, implemented some innovative and complex (or simple, depending on how you look at it) solutions.

Comments are closed.

How to subtract or see differences between two DataFrames in Spark?
This website uses cookies to improve your experience. By using this website you agree to our Data Protection Policy.

Hadoop In Real World is now Big Data In Real World!

X