How to properly remove or decommission a node from an Elasticsearch cluster?
April 24, 2023Getting started with Snowflake in less than 10 minutes
May 1, 2023Unfortunately there is no built in function to transpose a DataFrame from columns to rows in Spark. In this post we will show an easy way to transpose a DataFrame from columns to rows in Spark.
Here is our data.
val data = Seq(
("Ingestion", "Jerry", 1000), ("Ingestion", "Arya", 2000), ("Ingestion", "Emily", 3000),
("ML", "Riley", 9000), ("ML", "Patrick", 1000), ("ML", "Mickey", 8000),
("Analytics", "Donald", 1000), ("Ingestion", "John", 1000), ("Analytics", "Emily", 8000),
("Analytics", "Arya", 10000), ("BI", "Mickey", 12000), ("BI", "Martin", 5000))
import spark.sqlContext.implicits._
val df = data.toDF("Project", "Name", "Cost_To_Project")
df.show()
+---------+-------+---------------+
| Project| Name|Cost_To_Project|
+---------+-------+---------------+
|Ingestion| Jerry| 1000|
|Ingestion| Arya| 2000|
|Ingestion| Emily| 3000|
| ML| Riley| 9000|
| ML|Patrick| 1000|
| ML| Mickey| 8000|
|Analytics| Donald| 1000|
|Ingestion| John| 1000|
|Analytics| Emily| 8000|
|Analytics| Arya| 10000|
| BI| Mickey| 12000|
| BI| Martin| 5000|
+---------+-------+---------------+
We want to transpose this data below. As you can see the columns are now turned into rows.
+---------------+---------+ |key |value | +---------------+---------+ |Project |Ingestion| |Name |Jerry | |Cost_To_Project|1000 | |Project |Ingestion| |Name |Arya | |Cost_To_Project|2000 | |Project |Ingestion| |Name |Emily | |Cost_To_Project|3000 | |Project |ML | |Name |Riley | |Cost_To_Project|9000 | |Project |ML | |Name |Patrick | |Cost_To_Project|1000 | |Project |ML | |Name |Mickey | |Cost_To_Project|8000 | |Project |Analytics| |Name |Donald | +---------------+---------+
Create a map of all columns
For each row in the DataFrame we will create a map of each column in the row with column name as the key and value of the column as value.
var colAndVals = df.columns.flatMap { c => Array(lit(c), col(c)) }
scala> df.withColumn("myMap", map(colAndVals:_*)).show(false)
+---------+-------+---------------+---------------------------------------------------------------+
|Project |Name |Cost_To_Project|myMap |
+---------+-------+---------------+---------------------------------------------------------------+
|Ingestion|Jerry |1000 |[Project -> Ingestion, Name -> Jerry, Cost_To_Project -> 1000] |
|Ingestion|Arya |2000 |[Project -> Ingestion, Name -> Arya, Cost_To_Project -> 2000] |
|Ingestion|Emily |3000 |[Project -> Ingestion, Name -> Emily, Cost_To_Project -> 3000] |
|ML |Riley |9000 |[Project -> ML, Name -> Riley, Cost_To_Project -> 9000] |
|ML |Patrick|1000 |[Project -> ML, Name -> Patrick, Cost_To_Project -> 1000] |
|ML |Mickey |8000 |[Project -> ML, Name -> Mickey, Cost_To_Project -> 8000] |
|Analytics|Donald |1000 |[Project -> Analytics, Name -> Donald, Cost_To_Project -> 1000]|
|Ingestion|John |1000 |[Project -> Ingestion, Name -> John, Cost_To_Project -> 1000] |
|Analytics|Emily |8000 |[Project -> Analytics, Name -> Emily, Cost_To_Project -> 8000] |
|Analytics|Arya |10000 |[Project -> Analytics, Name -> Arya, Cost_To_Project -> 10000] |
|BI |Mickey |12000 |[Project -> BI, Name -> Mickey, Cost_To_Project -> 12000] |
|BI |Martin |5000 |[Project -> BI, Name -> Martin, Cost_To_Project -> 5000] |
+---------+-------+---------------+---------------------------------------------------------------+
Explode the map
Explode the map once a map is created for each row.
df.withColumn("myMap", map(colAndVals:_*)).select(explode($"myMap")).show(false)
+---------------+---------+
|key |value |
+---------------+---------+
|Project |Ingestion|
|Name |Jerry |
|Cost_To_Project|1000 |
|Project |Ingestion|
|Name |Arya |
|Cost_To_Project|2000 |
|Project |Ingestion|
|Name |Emily |
|Cost_To_Project|3000 |
|Project |ML |
|Name |Riley |
|Cost_To_Project|9000 |
|Project |ML |
|Name |Patrick |
|Cost_To_Project|1000 |
|Project |ML |
|Name |Mickey |
|Cost_To_Project|8000 |
|Project |Analytics|
|Name |Donald |
+---------------+---------+
Full code
val data = Seq(
("Ingestion", "Jerry", 1000), ("Ingestion", "Arya", 2000), ("Ingestion", "Emily", 3000),
("ML", "Riley", 9000), ("ML", "Patrick", 1000), ("ML", "Mickey", 8000),
("Analytics", "Donald", 1000), ("Ingestion", "John", 1000), ("Analytics", "Emily", 8000),
("Analytics", "Arya", 10000), ("BI", "Mickey", 12000), ("BI", "Martin", 5000))
import spark.sqlContext.implicits._
val df = data.toDF("Project", "Name", "Cost_To_Project")
df.show()
var colAndVals = df.columns.flatMap { c => Array(lit(c), col(c)) }
df.withColumn("myMap", map(colAndVals:_*)).select(explode($"myMap")).toDF("Columns","Values").show(false)


1 Comment
[…] The Big Data in Real World team makes a pivot: […]