How to merge multiple output files from MapReduce or Spark jobs to one? - Big Data In Real World

How to merge multiple output files from MapReduce or Spark jobs to one?

How to list all the available brokers in a Kafka cluster?
January 11, 2021
How does Broadcast Hash Join work in Spark?
January 15, 2021
How to list all the available brokers in a Kafka cluster?
January 11, 2021
How does Broadcast Hash Join work in Spark?
January 15, 2021

When you run a MapReduce or a Spark job the number of files will equal to the number of reducers involved in the MapReduce job or number of tasks involved in the last stage of Spark job.

It is quite easy to combine the multiple files into one file. You can write a small utility script with Linux commands to do the same but why not use the power of a distributed cluster to achieve the same?

Do you like us to send you a 47 page Definitive guide on Spark join algorithms? ===>

Solution

Use hadoop fs -getmerge  to combine multiple output files to in to one.

hadoop fs -getmerge [-nl] <src> <localdst>

Takes a source directory and a destination file as input and concatenates files in src into the destination local file. 

Optionally -nl can be set to enable adding a newline character (LF) at the end of each file. 

Input – directory and output – file

hadoop fs -getmerge -nl /src /home/hirw/big-output-file

Input – set of files and output – file

hadoop fs -getmerge -nl /src/file1.txt /src/file2.txt /home/hirw/big-output-file

 

Big Data In Real World
Big Data In Real World
We are a group of Big Data engineers who are passionate about Big Data and related Big Data technologies. We have designed, developed, deployed and maintained Big Data applications ranging from batch to real time streaming big data platforms. We have seen a wide range of real world big data problems, implemented some innovative and complex (or simple, depending on how you look at it) solutions.

Comments are closed.

How to merge multiple output files from MapReduce or Spark jobs to one?
This website uses cookies to improve your experience. By using this website you agree to our Data Protection Policy.

Hadoop In Real World is now Big Data In Real World!

X