How to transpose a DataFrame from columns to rows in Spark?

How to properly remove or decommission a node from an Elasticsearch cluster?

April 24, 2023

Getting started with Snowflake in less than 10 minutes

May 1, 2023

Published by Big Data In Real World at April 27, 2023

Tags

Unfortunately there is no built in function to transpose a DataFrame from columns to rows in Spark. In this post we will show an easy way to transpose a DataFrame from columns to rows in Spark.

Here is our data.

val data = Seq(
      ("Ingestion", "Jerry", 1000), ("Ingestion", "Arya", 2000), ("Ingestion", "Emily", 3000),
      ("ML", "Riley", 9000), ("ML", "Patrick", 1000), ("ML", "Mickey", 8000),
      ("Analytics", "Donald", 1000), ("Ingestion", "John", 1000), ("Analytics", "Emily", 8000),
      ("Analytics", "Arya", 10000), ("BI", "Mickey", 12000), ("BI", "Martin", 5000))

import spark.sqlContext.implicits._
val df = data.toDF("Project", "Name", "Cost_To_Project")
df.show()

+---------+-------+---------------+
|  Project|   Name|Cost_To_Project|
+---------+-------+---------------+
|Ingestion|  Jerry|           1000|
|Ingestion|   Arya|           2000|
|Ingestion|  Emily|           3000|
|       ML|  Riley|           9000|
|       ML|Patrick|           1000|
|       ML| Mickey|           8000|
|Analytics| Donald|           1000|
|Ingestion|   John|           1000|
|Analytics|  Emily|           8000|
|Analytics|   Arya|          10000|
|       BI| Mickey|          12000|
|       BI| Martin|           5000|
+---------+-------+---------------+

We want to transpose this data below. As you can see the columns are now turned into rows.

+---------------+---------+
|key            |value    |
+---------------+---------+
|Project        |Ingestion|
|Name           |Jerry    |
|Cost_To_Project|1000     |
|Project        |Ingestion|
|Name           |Arya     |
|Cost_To_Project|2000     |
|Project        |Ingestion|
|Name           |Emily    |
|Cost_To_Project|3000     |
|Project        |ML       |
|Name           |Riley    |
|Cost_To_Project|9000     |
|Project        |ML       |
|Name           |Patrick  |
|Cost_To_Project|1000     |
|Project        |ML       |
|Name           |Mickey   |
|Cost_To_Project|8000     |
|Project        |Analytics|
|Name           |Donald   |
+---------------+---------+

Create a map of all columns

For each row in the DataFrame we will create a map of each column in the row with column name as the key and value of the column as value.

var colAndVals = df.columns.flatMap { c => Array(lit(c), col(c)) }

scala> df.withColumn("myMap", map(colAndVals:_*)).show(false)
+---------+-------+---------------+---------------------------------------------------------------+
|Project  |Name   |Cost_To_Project|myMap                                                          |
+---------+-------+---------------+---------------------------------------------------------------+
|Ingestion|Jerry  |1000           |[Project -> Ingestion, Name -> Jerry, Cost_To_Project -> 1000] |
|Ingestion|Arya   |2000           |[Project -> Ingestion, Name -> Arya, Cost_To_Project -> 2000]  |
|Ingestion|Emily  |3000           |[Project -> Ingestion, Name -> Emily, Cost_To_Project -> 3000] |
|ML       |Riley  |9000           |[Project -> ML, Name -> Riley, Cost_To_Project -> 9000]        |
|ML       |Patrick|1000           |[Project -> ML, Name -> Patrick, Cost_To_Project -> 1000]      |
|ML       |Mickey |8000           |[Project -> ML, Name -> Mickey, Cost_To_Project -> 8000]       |
|Analytics|Donald |1000           |[Project -> Analytics, Name -> Donald, Cost_To_Project -> 1000]|
|Ingestion|John   |1000           |[Project -> Ingestion, Name -> John, Cost_To_Project -> 1000]  |
|Analytics|Emily  |8000           |[Project -> Analytics, Name -> Emily, Cost_To_Project -> 8000] |
|Analytics|Arya   |10000          |[Project -> Analytics, Name -> Arya, Cost_To_Project -> 10000] |
|BI       |Mickey |12000          |[Project -> BI, Name -> Mickey, Cost_To_Project -> 12000]      |
|BI       |Martin |5000           |[Project -> BI, Name -> Martin, Cost_To_Project -> 5000]       |
+---------+-------+---------------+---------------------------------------------------------------+

Explode the map

Explode the map once a map is created for each row.

df.withColumn("myMap", map(colAndVals:_*)).select(explode($"myMap")).show(false)

+---------------+---------+
|key            |value    |
+---------------+---------+
|Project        |Ingestion|
|Name           |Jerry    |
|Cost_To_Project|1000     |
|Project        |Ingestion|
|Name           |Arya     |
|Cost_To_Project|2000     |
|Project        |Ingestion|
|Name           |Emily    |
|Cost_To_Project|3000     |
|Project        |ML       |
|Name           |Riley    |
|Cost_To_Project|9000     |
|Project        |ML       |
|Name           |Patrick  |
|Cost_To_Project|1000     |
|Project        |ML       |
|Name           |Mickey   |
|Cost_To_Project|8000     |
|Project        |Analytics|
|Name           |Donald   |
+---------------+---------+

Full code

val data = Seq(
      ("Ingestion", "Jerry", 1000), ("Ingestion", "Arya", 2000), ("Ingestion", "Emily", 3000),
      ("ML", "Riley", 9000), ("ML", "Patrick", 1000), ("ML", "Mickey", 8000),
      ("Analytics", "Donald", 1000), ("Ingestion", "John", 1000), ("Analytics", "Emily", 8000),
      ("Analytics", "Arya", 10000), ("BI", "Mickey", 12000), ("BI", "Martin", 5000))

import spark.sqlContext.implicits._
val df = data.toDF("Project", "Name", "Cost_To_Project")
df.show()

var colAndVals = df.columns.flatMap { c => Array(lit(c), col(c)) }

df.withColumn("myMap", map(colAndVals:_*)).select(explode($"myMap")).toDF("Columns","Values").show(false)

Big Data In Real World

We are a group of Big Data engineers who are passionate about Big Data and related Big Data technologies. We have designed, developed, deployed and maintained Big Data applications ranging from batch to real time streaming big data platforms. We have seen a wide range of real world big data problems, implemented some innovative and complex (or simple, depending on how you look at it) solutions.

1 Comment

Transposing a Spark DataFrame – Curated SQL says:

May 1, 2023 at 7:10 am

[…] The Big Data in Real World team makes a pivot: […]

How to transpose a DataFrame from columns to rows in Spark?

How to properly remove or decommission a node from an Elasticsearch cluster?

Getting started with Snowflake in less than 10 minutes

How to properly remove or decommission a node from an Elasticsearch cluster?

Getting started with Snowflake in less than 10 minutes

Create a map of all columns

Explode the map

Full code

Big Data In Real World

Related posts

Sunset: Spark Developer In Real World cluster

How to kill a running Spark application?

What is the default number of executors in Spark?

1 Comment