What is the difference between Hive internal tables and external tables? - Big Data In Real World

What is the difference between Hive internal tables and external tables?

How to check size of a directory in HDFS?
January 18, 2021
How does Shuffle Sort Merge Join work in Spark?
January 22, 2021
How to check size of a directory in HDFS?
January 18, 2021
How does Shuffle Sort Merge Join work in Spark?
January 22, 2021

Hive stores metadata information about the tables created in Hive in a relational database like Derby, MySQL etc. The metadata information includes table name, structure of table, partition information, location of the datasets etc.

Note that Hive does not store or manage the data behind the tables in Hive. 

Internal tables

All metadata information of internal tables is managed by Hive

When an internal table is dropped, Hive will also drop the data relevant to the table.

 

External tables

Like internal tables, all metadata information of external tables are managed by Hive.

Unlike internal tables, when an external table is dropped, Hive will not drop the data relevant to the table.

 

When to use an internal table and when to use an external table?

A good use case to use an internal table is when you are using Hive to hold some intermediate data. In that case, when you drop the table you also want the data behind the table to be dropped.

Internal tables also make sense when you drop and recreate tables in Hive quite a lot. In that case you may not want to keep accumulating data.

In most cases, external tables make sense. In most real world scenarios your Hive table is probably fed by external processes like Spark jobs and consumed by applications outside Hive. In such instances Hive is used merely to hold the metadata and data is actually managed by processes outside of Hive so it makes sense to keep the data intact when we drop the Hive table.

To make this simple, when in doubt, always create an external table.

Big Data In Real World
Big Data In Real World
We are a group of Big Data engineers who are passionate about Big Data and related Big Data technologies. We have designed, developed, deployed and maintained Big Data applications ranging from batch to real time streaming big data platforms. We have seen a wide range of real world big data problems, implemented some innovative and complex (or simple, depending on how you look at it) solutions.

Comments are closed.

What is the difference between Hive internal tables and external tables?
This website uses cookies to improve your experience. By using this website you agree to our Data Protection Policy.

Hadoop In Real World is now Big Data In Real World!

X