Hive Data Manipulation – Loading Data to Hive Tables

By now, we have created different types of tables with a variety of data types in the Hive. But we have not seen how we can populate these tables with data either from files in our local system or some already existing data in HDFS. In today’s tutorial, we will learn about different ways in which we can fill data into Hive tables.

Loading Local System Data to HDFS Table

For this tutorial, consider simple table below which does not have any data in it now.

Using HDFS command below, we can copy data from our local system into a Hive table. We can either specify OVERWRITE which clears existing data in table and loads new data or we can omit OVERWRITE which appends new data to existing table data. Let us see it in action below.

Here we can see that the employee table has been loaded with data from our local file system. We can also try the same same query without the OVERWRITE clause.

Loading Data From HDFS Location into Hive Table

We can use the same command as above to load data from HDFS location to Hive table. We only have to remove the “LOCAL” keyword from command. There is another thing to note while loading data from HDFS location to hive table.

When we load data from Local system to Hive table LOAD DATA command copies(copy+paste) file from system to HDFS. But when we use LOAD DATA with HDFS file location then it moves (cut + paste)data from source location to target location.

Hive Facts

In the above example, we can see that the LOAD data command has copied the file from our HDFS location into the hive table and also deleted that file from the original location.

Loading Data From One Table to Another Table in Hive

We can also load data from one hive table to another. you may have one large table with all kinds of information and you need very specific data from it. In that case you can use the INSERT query like below to add data to your table.

Insert command will trigger a map-reduce job and insert data from a source table to a destination table. Just like the LOAD DATA command we can use the INTO clause instead of OVERWRITE, which will append data to an existing table.

We do not need to create a table beforehand to insert data into it. We can use the hive’s Create Table as a syntax to create table on the fly and insert data into it.

Multi Insert Command

In hive we can use multiple insert commands in a single query. This is useful when we want to scan the entire table once and divide it into smaller set of tables in one single query.

Consider the example, you have large tables with employees and you want to create a separate table for each state which has all employees of that state. We can do it with where clause and adding employees to the respective table. However this will cause our large table to be scanned multiple times. Instead of that, we can use the query below to process this requirement efficiently.

This feature is quite powerful as we can populate multiple tables in a single query. When we will learn about Partitions in the Hive, this feature will be even more useful to dynamically create partitions and fill them with proper data.

Conclusion

In this article, we have learned how to insert data into hive tables using different ways. I hope you have found this useful. See you in the next blog.

Similar Posts

Leave a Reply

Your email address will not be published.