External Vs Internal(Managed) Tables in Hive

Updated On August 11, 2020 | By Mahesh Mogal

In Hive, we have two kinds of tables available. Managed or internal tables that are controlled by the hive when it comes to their data and metadata. And the second type of tables is the External table, hive only control metadata for these tables. In this tutorial we will dive deep to learn more about these two types of tables.

Managed Tables in Hive

Hive is responsible for(more or less) the life cycle of these tables. By default hive creates managed tables. That means any table which we do not explicitly specify as an external table, will be created as an Internal or managed table.

When we drop managed tables from the hive, not only its metadata is deleted from Hive but also data is deleted from HDFS. This is why these tables are known as Managed tables as Hive manages there Data as well. Let us see them in action.

hive (maheshmogal)> create table test (id int, name string);
OK
Time taken: 0.792 seconds
hive (maheshmogal)> show create table test;
OK
CREATE TABLE `test`(
  `id` int,
  `name` string)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  'hdfs://loclahost:8020/apps/hive/warehouse/maheshmogal.db/test' -- location of managed table

Hive keeps managed tables in sub-directory created under the database directory. (Check table location in the above query). We can specify another location for Managed tables as well. But this may create confusion in the future. So it is advisable to use external tables if we want to use non default location for the table.

We can check that if we drop this table hive will also delete "/apps/hive/warehouse/maheshmogal.db/test" directory along with table metadata.

hive (maheshmogal)> dfs -ls /apps/hive/warehouse/maheshmogal.db;
Found 1 items
drwxrwxrwx   - maheshmogal hdfs          0 2020-05-26 07:58 /apps/hive/warehouse/maheshmogal.db/test
hive (maheshmogal)> drop table test;
OK
Time taken: 0.625 seconds
hive (maheshmogal)> dfs -ls /apps/hive/warehouse/maheshmogal.db;
hive (maheshmogal)>
-- "test"directroy is deleted once table is dropped

External Tables in Hive

When we create a table with the EXTERNAL keyword, it tells hive that table data is located somewhere else other than its default location in the database. That is why when we create the EXTERNAL table we need to specify its location in the create query.

So what happens when we drop the external table? Hive only drops metadata for that table keeping original data at its location. As it is an external table, Hive does not assume that it owns table data and keeps it as it is in HDFS.

We can validate this using below queries. HDFS directory is still there event after we have dropped Hive table from database.

hive (maheshmogal)> create external table dept_test (id int, name string)
                  > location "/user/maheshmogal/departments";
OK
Time taken: 0.22 seconds
hive (maheshmogal)> dfs -ls /user/maheshmogal/departments
                  > ;
Found 2 items
-rw-r--r--   2 maheshmogal hdfs          0 2020-05-24 23:51 /user/maheshmogal/departments/_SUCCESS
-rw-r--r--   2 maheshmogal hdfs         60 2020-05-24 23:51 /user/maheshmogal/departments/part-m-00000
hive (maheshmogal)> drop table dept_test;
OK
Time taken: 0.41 seconds
hive (maheshmogal)> dfs -ls /user/maheshmogal/departments;
Found 2 items
-rw-r--r--   2 maheshmogal hdfs          0 2020-05-24 23:51 /user/maheshmogal/departments/_SUCCESS
-rw-r--r--   2 maheshmogal hdfs         60 2020-05-24 23:51 /user/maheshmogal/departments/part-m-00000
-- data is still there even after we have dropped table from hive

Conclusion

We have learnt about two types of tables in Hive. Hive owns data for Managed tables along with Table metadata. However for external tables, Hive only owns table metadata. External tables add extra flexibility as our data is safe from accidental drops and that data can easily be shared by multiple entities operating on HDFS (like pig, spark, etc).

In most cases using external tables is better option but there are use-cases for managed tables as well like intermediate temporary tables ( as data will be deleted with table once they serve their purpose) I hope you found this article useful. See you in next one (Y).

.

Mahesh Mogal

I am passionate about Cloud, Data Analytics, Machine Learning, and Artificial Intelligence. I like to learn and try out new things. I have started blogging about my experience while learning these exciting technologies.

Stay Updated with Latest Blogs

Get latest blogs delivered to your mail directly.

Recent Posts

Sorting in Spark Dataframe

In this blog, we will learn how to sort rows in spark dataframe based on some column values.

Read More
Removing White Spaces From Data in Spark

White spaces can be a headache if not removed before processing data. We will learn how to remove spaces from data in spark using inbuilt functions.

Read More
Padding Data in Spark Dataframe

In this blog, we will learn how to use rpad and lpad functions to add padding to data in spark dataframe.

Read More

Leave a Reply

Your email address will not be published. Required fields are marked *

linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram
Share via
Copy link
Powered by Social Snap