Hadoop Archives (HAR)

Hadoop Archives (HAR) offers an effective way to deal with the small files problem. This post will explain –

The problem with small files
What is HAR?
Limitations of HAR files

The problem with small files

Hadoop works best with big files and small files are handled inefficiently in HDFS. As we know, Namenode holds the metadata information in memory for all the files stored in HDFS. Let’s say we have a file in HDFS which is 1 GB in size and the Namenode will store metadata information of the file – like file name, creator, created time stamp, blocks, permissions etc.

Now assume we decide to split this 1 GB file in to 1000 pieces and store all 100o “small” files in HDFS. Now Namenode has to store metadata information of 1000 small files in memory. This is not very efficient – first it takes up a lot of memory and second soon Namenode will become a bottleneck as it is trying to manage a lot of data.

What is HAR?

Hadoop Archives or HAR is an archiving facility that packs files in to HDFS blocks efficiently and hence HAR can be used to tackle the small files problem in Hadoop. HAR is created from a collection of files and the archiving tool (a simple command) will run a MapReduce job to process the input files in parallel and create an archive file.

HAR command

hadoop archive -archiveName myhar.har /input/location /output/location

Once a .har file is created, you can do a listing on the .har file and you will see it is made up of index files and part files. Part files are nothing but the original files concatenated together in to a big file. Index files are look up files which is used to look up the individual small files inside the big part files.

hadoop fs -ls /output/location/myhar.har

/output/location/myhar.har/_index
/output/location/myhar.har/_masterindex
/output/location/myhar.har/part-0

Limitations of HAR files

Once an archive file is created, you can not update the file to add or remove files. In other words, har files are immutable.
Archive file will have a copy of all the original files so once a .har is created it will take as much space as the original files. Don’t mistake .har files for compressed files.
When a .har file is given as an input to MapReduce job, the small files inside the .har file will be processed individually by separate mappers which is inefficient.

Big Data In Real World

We are a group of Big Data engineers who are passionate about Big Data and related Big Data technologies. We have designed, developed, deployed and maintained Big Data applications ranging from batch to real time streaming big data platforms. We have seen a wide range of real world big data problems, implemented some innovative and complex (or simple, depending on how you look at it) solutions.

Comments are closed.

Hadoop Archives (HAR)

Pig vs. Hive

How much memory your Namenode need?

Pig vs. Hive

How much memory your Namenode need?

Hadoop Archives (HAR)

The problem with small files

What is HAR?

HAR command

Limitations of HAR files

Big Data In Real World

Related posts

How to recursively delete files, folders or bucket from S3?

Hadoop In Real World is now Big Data In Real World!

Hadoop In Real World is changing to Big Data In Real World